2016-02-24: Acquisition of Mementos and Their Content Is More Challenging Than Expected

Recently, we conducted an experiment using mementos for almost 700,000 web pages from more than 20 web archives.  These web pages spanned much of the life of the web (1997-2012). Much has been written about acquiring and extracting text from live web pages, but we believe that this is an unparalleled attempted to acquire and extract text from mementos themselves. Our experiment is also distinct from
AlNoamany's work or Andy Jackson's work, because we are trying to acquire and extract text from mementos across many web archives, rather than just one.

We initially expected the acquisition and text extraction of mementos to be a relatively simple exercise, but quickly discovered that the idiosyncrasies between web archives made these operations much more complex.  We document our findings in a technical report entitled:  "Rules of Acquisition for Mementos and Their Content".

Our technical report briefly covers the following key points:
  • Special techniques for acquiring mementos from the WebCite on-demand archive (http://www.webcitation.org)
  • Special techniques for dealing with JavaScript Redirects created by the Internet Archive
  • An alternative to BeautifulSoup for removing elements and extracting text from mementos
  • Stripping away archive-specific additions to memento content
  • An algorithm for dealing with inaccurate character encoding
  • Differences in whitespace treatment between archives for the same archived page
  • Control characters in HTML and their effect on DOM parsers
  • DOM-corruption in various HTML pages exacerbated by how the archives present the text stored within <noscript> elements
Rather than repeating the entire technical report here, we want to focus on the two issues of interest that may have the greater impact on others acquiring and experimenting with mementos: acquiring mementos from Web Cite and inaccurate character encoding.

Acquisition of Content from WebCite


WebCite is an on-demand archive specializing in archiving web pages used as citations in scholarly work.  An example WebCite page is shown below.
For acquiring most memento content, we utilized the cURL data transfer tool.  With this tool, one merely types the following command to save the contents of the URI http://www.example.com:

curl -o outputfile.html http://www.example.com

For WebCite, the output from cURL for a given URI-M results in the same HTML frameset content, regardless of which URI-M is used.  We sought to acquire the actual content of a given page for text extraction, so merely utilizing cURL was insufficient.  An example of this HTML is shown below.


Instead of relying on cURL, we analyzed the resulting HTML frameset and determined that the content is actually returned by a request to the mainframe.php file.  Unfortunately, merely issuing a request to the mainframe.php file is insufficient because the cookies sent to the browser indicate which memento should be displayed. We developed custom PhantomJS code, presented as Listing 1 in the technical report, for overcoming this issue.  PhatomJS, because it must acquire, parse, and process the content of a page, is much slower than merely using cURL.

The requirement to utilize a web browser, rather than HTTP only, for the acquisition of web content is common for live web content, as detailed by Kelly and Brunelle, but we did not anticipate that we would need a browser simulation tool, such as PhantomJS, to acquire memento content.

In addition to the issue of acquiring mementos, we also discovered reliability problems with Web Cite, seen in the figure below.  We would routinely need to reattempt downloads of the same URI-M in order to finally acquire its content.

Finally, we experienced rate limiting from Web Cite, forcing us to divide our list of URI-Ms and download content from several source networks.

Because of these issues, the acquisition of almost 100,000 mementos from Web Cite took more than 1 month to complete, compared to the acquisition of 1 million mementos from the Internet Archive in 2 weeks.

Inaccurate Character Encoding


Extracting text from documents requires that such text be decoded properly for processes such as text similarity or topic analysis.   For a subset of mementos, some archives do not present the correct character set in the HTTP Content-Type header.  Even though most web sites now use the UTF-8 character set, a subset of our mementos come from a time before UTF-8 was adopted so proper decoding becomes an issue.

To address this issue, we developed a simple algorithm that attempts to detect and use the character encoding for a given document.

  1. Use the character set from the HTTP Content-Type header, if present; otherwise try UTF-8.
  2. If a character encoding is discovered in the file contents, as is common for XHTML documents, then try to use that; otherwise try UTF-8.
  3. If any of the character sets encountered raise an error, raise our own error.

We fall back to UTF-8 because it is an effective superset of many of the character sets for the mementos in our collection, such as ASCII. This algorithm worked for more than 99% of our dataset.

In the future, we intend to explore the use of confidence-based tools, such as the chardet library, to guess the character set when extracting text.  The use of such tools takes more time than merely using the Content-Type header, but are necessary when that header is unreliable and algorithms such as ours fail.

Summary


We were able to overcome most of the memento acquisition and text extraction issues encountered in our experiment.  Because we were unaware of the problems we would encounter, we felt that it would be useful to detail our solutions for others to assist them in their own research and engineering.

--
Shawn M. Jones
PhD Student, Old Dominion University
Graduate Research Assistant, Los Alamos National Laboratory
- and -
Harihar Shankar
Research & Development Engineer, Los Alamos National Laboratory

Comments

Post a Comment