2015-08-20: ODU, L3S, Stanford, and Internet Archive Web Archiving Meeting



Two weeks ago (on Aug 3, 2015), I was glad to be invited to visit Internet Archive in San Francisco in order to share our latest work with a set of the Web Archiving pioneers from around the world.

The attendees were Jefferson Bailey and Vinay Goel from IA, Nicholas Taylor and Ahmed AlSum from Stanford, and Wolfgang Nejdl, Ivana Marenzi and Helge Holzmann from L3S.

First, we took a quick introduction to each others mentioning the purpose and the nature of our work to IA.

Then, Nejdl introduced the Alexandria project, and demoed the ArchiveWeb project, which aims to develop tools and techniques to explore and analyze Web archives in a meaningful way. In the project, they develop tools that will allow users to visualize and collaboratively interact with Archive-it collections by adding new resources in the form of tags and comments. Furthermore, it contains a collaborative search and sharing platform.

I presented the off-topic detection work with a live demo for the tool, which can be downloaded and tested from https://github.com/yasmina85/OffTopic-Detection.


The off-topic tool aims to automatically detect when the archived page goes off-topic, which means the page changed through time to move away from the initial scope of the page. The tool suggests a list of off-topic pages based on a specific threshold that is input by the user. Based on evaluating the tool, we suggest values for the threshold in a research paper* that can be used to detect the off-topic pages.

A site for one of the candidates for Egypt’s 2012 presidential election. Many of the captures of hamdeensabhay.com are not about the Egyptian Revolution. Later versions show an expired domain (as does the live Web version).

Examples for the usage of the tool:
--------

Example 1: Detecting off-topic pages in 1826 collection

python detect_off_topic.py -i 1826 -th 0.15
extracting seed list

http://agroecol.umd.edu/Research/index.cfm
http://casademaryland.org

50 URIs are extracted from collection https://archive-it.org/collections/1826
Downloading timemap using uri http://wayback.archive-it.org/1826/timemap/link/http://agroecol.umd.edu/Research/index.cfm
Downloading timemap using uri http://wayback.archive-it.org/1826/timemap/link/http://casademaryland.org

Downloading 4 mementos out of 306
Downloading 14 mementos out of 306

Detecting off-topic mementos using Cosine Similarity method

Similarity memento_uri
0.0 http://wayback.archive-it.org/1826/20131220205908/http://www.mncppc.org/commission_home.html/
0.0 http://wayback.archive-it.org/1826/20141118195815/http://www.mncppc.org/commission_home.html

Example 2: Detecting off-topic pages for http://hamdeensabahy.com/

python detect_off_topic.py -t https://wayback.archive-it.org/2358/timemap/link/http://hamdeensabahy.com/  -m wcount -th -0.85

Downloading 0 mementos out of 270
http://wayback.archive-it.org/2358/20140524131241/http://www.hamdeensabahy.com/
http://wayback.archive-it.org/2358/20130321080254/http://hamdeensabahy.com/
http://wayback.archive-it.org/2358/20130621131337/http://www.hamdeensabahy.com/
http://wayback.archive-it.org/2358/20140602131307/http://hamdeensabahy.com/
http://wayback.archive-it.org/2358/20140528131258/http://www.hamdeensabahy.com/
http://wayback.archive-it.org/2358/20130617131324/http://www.hamdeensabahy.com/


Downloading 4 mementos out of 270

Extracting text from the html

Detecting off-topic mementos using Word Count method

Similarity memento_uri
-0.979434447301 http://wayback.archive-it.org/2358/20121213102904/http://hamdeensabahy.com/

-0.966580976864 http://wayback.archive-it.org/2358/20130321080254/http://hamdeensabahy.com/

-0.94087403599 http://wayback.archive-it.org/2358/20130526131402/http://www.hamdeensabahy.com/

-0.94087403599 http://wayback.archive-it.org/2358/20130527143614/http://www.hamdeensabahy.com/


Nicholas insisted on the importance of the off-topic tool from QA perspective, while Internet Archives folks focused on the required computation resources and how it can be shared with Archive-It partners. The group discussed some user interface options to display the output of the tool.

After the demo, we discussed the importance of the tool, especially in the crawling quality assurance practices.  While demoing ArchiveWeb interface, some of the visualization for pages from different collections showed off-topic pages.  We all agreed that it is important that those pages won’t appear to the users when they browse the collections.

It was amazing to spend time in IA and knowing about the last trend from other research groups. The discussion showed the high reputation of WS-DL research in the web archiving community around the world.

*Yasmin AlNoamany, Michele C. Weigle, and Michael L. Nelson, Detecting Off-Topic Pages in Web Archives, Proceedings of TPDL 2015, 2015.

----
Yasmin




Comments