Monday, October 3, 2016

2016-10-03: Which States and Topics did the Two Presidential Candidates Mention?

"Team Turtle" in Archive Unleashed in Washington DC
(from left to right: N. Chah, S. Marti, M. Aturban , and I. Amin)
The first presidential debate (H. Clinton v. D. Trump) took place on last Monday, September 26, 2016 at Hofstra University, New York. The questions were about topics like economy, taxes, jobs, and race. During the debate, the candidates mentioned those topics (and other issues) and, in many cases, they associated a topic with a particular place or a US state (e.g., shootings in Chicago, Illinois, and crime rate in New York). This reminded me about the work that we had done in the second Archives Unleashed Hackathon, held at the Library of Congress in Washington DC. I worked with the "Team Turtle" (Niel Chah, Steve Marti, Mohamed Aturban, and Imaduddin Amin) on analyzing an archived collection, provided by the Library of Congress, about the 2004 Presidential Election (G. Bush v. J. Kerry). The collection contained hundreds of archived web sites in ARC format. These key web sites are maintained by the candidates or their political parties (e.g.,,,, and or other newspapers like and They were crawled on the days around the election day (November 2, 2004). The goal of this project was to investigate "How many times did each candidate mention each state?" and "What topics were they talking about?"

In this event, we had limited time (two days) to finish our project and present findings by the end of the second day. Fortunately, we were able to make it through three main steps: (1) extract plain text from ARC files, (2) apply some techniques to extract named entities and topics, and (3) build a visualization tool to better show the results. Our processing scripts are available on GitHub.

[1] Extract textual data from ARC files:

ARC file format specifies a way to store multiple digital resources in a single file. It is used heavily by the web archive community to store captured web pages (e.g., Internet Archive's Heritrix writes what it finds on the Web in ARC files of 100MB each). ARC is the predecessor format to the now more popular WARC format. We were provided with 145 ARC files, and each of these files contained hundreds of web pages. To read the content of these ARC files, we decided to use Warcbase, an interesting open-source platform for managing web archives. We started by installing Warcbase by following these instructions. Then, we wrote several Apache Spark's Scala scripts to be able to iterate over all ARC files and generate a clean textual version (e.g., by removing all HTML tags). For each archived web page, we extracted its unique ID, crawl date, domain name, full URI, and textual content as shown below (we hid the content of web pages due to copyright issues). Results were collected into a single TSV file.

[2] Extract named entities and topics

We used Stanford Named Entity Recognizer (NER) to tag people and places, while for topic modeling, we used the following techniques:
After applying the above techniques, the results were aggregated in a text file which will be used as input to the visualization tool (described in step [3]). A part of the results are shown in the table below.

State Candidate Frequency of mentioning
the state
The most important
Mississippi Kerry
Mississippi Bush
Oklahoma Kerry
Oklahoma Bush
Delaware Kerry
Delaware Bush
Minnesota Kerry
Minnesota Bush
Illinois Kerry
Illinois Bush
Georgia Kerry
Georgia Bush
Arkansas Kerry
Arkansas Bush
New Mexico Kerry
New Mexico Bush
Indiana Kerry
Indiana Bush
Maryland Kerry
Maryland Bush
Louisiana Kerry
Louisiana Bush
Texas Kerry
Texas Bush
Tennessee Kerry
Tennessee Bush
Arizona Kerry
Arizona Bush

[3]  Interactive US map 

We decided to build an interactive US map using D3.js. As shown below, the state color indicates the winning party (i.e., red for Republican and blue for Democratic) while the size of the bubbles indicates how many times the state was mentioned by the candidate. The visualization required us to provide more information manually like the winning party for each state. In addition, we inserted different locations, latitude and longitude, to locate the bubbles on the map (two circles for each state). By hovering over the bubbles, the most important topic mentioned by the candidate will be shown. If you are interested to interact with the map, visit (

Looking at the map might help us answer the research questions, but it might raise other questions, such as why Republicans did not talk about topics related to states like North Dakota, South Dakota, and Utah. Is it because they are always considered as "red" states? On the other hand, it is clear that they paid more attention to other "swing" states like Colorado and Florida. Finally, I would say that it might be useful to introduce this topic at this time as we are close to the next 2016 presidential election (H. Clinton v. D. Trump), and the same analysis could apply again to see what newspapers say about this event.

--Mohamed Aturban

2016-10-03: Summary of “Finding Pages on the Unarchived Web"

In this paper, the authors detailed their approach to recover the unarchived Web based on links and anchors of crawled pages. The data used was from the Dutch 2012 Web archive at the National Library of the Netherlands (KB), totaling about 38 million webpages. The collection was selected by the library based on categories related to Dutch history, social and cultural heritage. Each website is categorized using UNESCO code. The authors try to address three research questions: Can we recover a significant fraction of unarchived pages?, How rich are the representations for the unarchived pages?, and Are these representations rich enough to characterize the content?

The link extraction used Hadoop MapReduce and Apache Pig to process all archived webpages and used JSoup to extract links from their content. A second MapReduce job was to index the URLs and check if they are archived or not. Then the data was deduplicated based on the value of year, anchor text, source, target, and hashcode (MD5). In addition basic cleaning and processing was performed on the data set. The resulting number of pages in the dataset was 11 million webpages. Both external links (inter-server links) which are links between different servers, and site internal links (intra-server links) which occur within a server were included in the data set. Apache Pig script was used to aggregate the extracted links to different element such as TLD, domains, hosts, and file type.

The processed file list is as following:
(sourceURL, sourceUnesco, sourceInSeedProperty, targetURL, targetUnesco, targetInSeedProperty, anchorText, crawlDate, targetInArchiveProperty, sourceHash).

There are four main classification of URLs found in this data set, shown in Figure 1:
1-Intentionally archived URLs in the seed list, which is 92% of the dataset (10.1M).
2-Unintentionally archived URLs due to crawler configuration, which is 8% of the dataset (0.8M).
3-Inner Aura: unarchived URLs which the parent domain is included in the seed list (5.5M), (20% depth 4, because 94% are links to the site).
4-Outer Aura: unarchived URLs which do not have a parent domain that is on the seed list (5.2M), (29.7% depth 2).

In this work, the Aura is defined as Web documents which are not included in the archived collection but are known to have existed through references to those unarchived Web documents in the archived pages.

They analyzed the four classification and checked unique hosts, domain, and TLD. They found that unintentionally archived URLs have higher percentage of unique hosts, domain, and TLD compared to intentionally archived URLs. And that outer Aura have higher percentage of unique hosts, domain and TLD compared to inner Aura.

When checking the Aura they found that most of the unarchived Aura points to textual web content. The inner Aura mostly had a (.nl) top level domain (95.7%) and the outer Aura had 34.7% (.com) TLD, 31.1% (.nl) TLD, and 18% (.jp) TLD. They high percentage of Japanese TLD is that they unintentionally archived those pages. Also, they analyzed the indegree of the Aura where all target representations in the outer Aura have at least one source link, 18% have at least 3 links, and 10% have 5 links or more. In addition, the Aura was compared by the number of intra-server links and the inter-server links, the inner Aura had 94.4% intra-server links. On the other hand the outer Aura had 59.7% of inter-server links.
The number of unique anchor text words for both inner and outer Aura was almost similar, 95% had at least one word describing them, 30% have at least three words, and 3% have 10 words or more.

To test the theory of finding missing unarchived Web pages, they took a random 300 websites where 150 are homepages and 150 are non-homepages. They made sure the websites selected are either live or archived. They found that 46.7% of the targets page were found within the top 10 SERP using anchor text. However for non-homepage 46% were found using texts obtained from the URLs. By combining anchor text and URL word evidence both homepage and non-homepage had a high percentage, 64% of the homepages, and 55.3% of the deeper pages can be retrieved. Another random sample of URLs was selected to check the anchor text and words from the link, and they found homepages can be represented with anchor text, on the other hand non-homepages are better represented with both anchor text and words from the link.

They found that the archived pages show evidence of a large number of unarchived pages and websites. They also found that only a few homepage webpages have rich representations. Finally, they found that even with a few words to describe a missing webpage they can be found within the first rank. Future work include adding further information such as surrounding text and advance retrieve models.


-Lulwah M. Alkwai

Tuesday, September 27, 2016

2016-09-27: Introducing Web Archiving in the Summer Workshop

For the last few years the Department of Computer Science at Old Dominion University invites a group of undergrad students from India and hosts them in the summer. They work closely with a research group on some relevant projects. Additionally, researchers from deferment research groups in the departments present their work to the guest students twice a week and introduce various different projects that they are working on. The goal of this practice is to allow them to collaborate with graduate students of the department and to encourage them for research studies. The invited students also act as ambassadors to share their experience with their colleagues and spread the word out when they go back to India.

This year a group of 16 students from Acharya Institute of Technology and B.N.M. Institute of Technology visited Old Dominion University, they were hosted under the supervision of Ajay Gupta. They worked in the areas of Sensor Networks and Mobil Application Development. They researched ways to integrate mobile devices with low-cost sensors to solve problems in health care-related areas and vehicular networks.

I (Sawood Alam) was selected to represent our Web Science and Digital Libraries Research Group this year on July 28. Mat and Hany represented the group in the past. I happened to be the last presenter before they return back to India, by the time they were overloaded with scholarly information. Additionally, the students were not primarily from the Web science or digital libraries background. So, I decided to keep my talk semi-formal and engaging rather than purely scientific. The slides were inspired from my last year's talk in Germany on "Web Archiving: A Brief Introduction".

I began with my presentation slides entitled, "Introducing Web Archiving and WSDL Research Group". I briefly introduced myself with the help of my academic footprint and the lexical signature. I described the agenda of the talk and established the motivation for Web archiving. From there, I followed the talk agenda as laid out before, covering topics like issues and challenges in Web archiving, existing tools, services, and research efforts, my own research work about Web archive profiling, and some open research topics in the field of Web archiving. Then I introduced the WSDL research Group along with all the fun things we do in the lab. Being an Indian, I was able to pull in some cultural references from India to keep the audience engaged and entertained while still being on the agenda of the talk.

I heard encouraging words from Ajay Gupta, Ariel Sturtevant, and some of the invited students after my talk as they acknowledged it being one of the most engaging talks during the entire summer workshop. I would like to thank all who were involved in organizing this summer workshop and gave me the opportunity to introduce my field of interest and the WSDL research group.

Sawood Alam

Monday, September 26, 2016

2016-09-26: IIPC Building Better Crawlers Hackathon Trip Report

Trip Report for the IIPC Building Better Crawlers Hackathon in London, UK.                           

On September 22-23, 2016, I attended the IIPC Building Better Crawlers Hackathon (#iipchack) at the British Library in London, UK. Having been to London almost exactly 2 years ago for the Digital Libraries 2014 conference, I was excited to go back, but was more so anticipating collaborating with some folks I had long been in contact with during my tenure as a PhD student researcher at ODU.

The event was a well-organized yet loosely scheduled meeting that resembled more of an "Unconference" than a Hackathon in that the discussion topics were defined as the event progressed rather than a larger portion being devoted to implementation (see the recent Archives Unleashed 1.0 and 2.0 trip reports). The represented organizations were:

Day 0

As everyone arrived at the event from abroad and locally, the event organizer Olga Holownia invited the attendees to an informal get-together meeting at The Skinners Arms. There the conversation was casual but frequently veered into aspects of web archiving and brain picking, which we were repeatedly encouraged to "Save for Tomorrow".

Day 1

The first day began with Andy Jackson (@anjacks0n) welcoming everyone and thanking them for coming despite the short notice and announcement of the event over the Summer. He and Gil Hoggarth (@grhggrth), both of the British Library, kept detailed notes of the conference happenings as they progressed with Andy keeping an editable open document for other attendees to collaborate on building.

Tom Cramer (@tcramer) of Stanford, who mentioned he had organized hackathons in the past, encouraged everyone in attendance (14 in number) to introduce themselves and give a synopsis of their role and their previous work at their respective institutions. He also asked how we could go about making crawling tools accessible to non-web archiving specialists to stimulate conversation.

The responding discussion initiated a theme that ran throughout the hackathon -- that of the web archiving from a web browser.

One tool to accomplish this is Brozzler from Internet Archive, which combines warcprox and Chromium to preserve HTTP content sent over the wire into the WARC format. I had previously attempted to get Brozzler (originally forked from Umbra) up and running but was not successful. Other attendees either had previously tried or had not heard of the software. This transitioned later into Noah Levitt (of Internet Archive) giving an interactive audience-participation walk through of installing, setting up, and using Brozzler.

Prior to the interactive portion of the event, however, Jefferson Bailey (@jefferson_bail) of Internet Archive started a presentation by speaking about WASAPI (Web Archiving Systems API), a specification for defining data transfer of web archives. The specification is a collaboration with University of North Texas, Rutgers, Stanford via LOCKSS, and other organizations. Jefferson emphasized that the specification is not implementation specific; it does not get into issues like access control, parameters of a specific path, etc. The rationale behind this was so that the spec would not be just a preservation data transport tool but also a means of data transfer for researcher. Their in-development implementation takes WARCs, pulls out data to generates a derivative WARC, then defines a Hadoop job using Pig syntax. Noah Levitt added that the Jobs API requires you to supply an operation like "Build CDX" and the WARCs on which you want to perform the operation.

In a typical non-linear unconference fashion (also exhibited in this blog post), Noah then gave details on Brozzler (presentation slides). With a room full of Mac and Linux users, installation proved particularly challenging. One issue I had previously run into was latency in starting RethinkDB. This issue was also exhibited by Colin Rosenthal (@colinrosenthal) while he was on Linux and I on Mac. Noah's machine, which he showed in a demo as having the exact same versions of all dependencies I had installed did not show this latency, so Your Mileage Might Vary with installation but in the end both Colin and I (possibly others) were successful in crawling a few URIs using Brozzler.

Andy added to Noah's interactive session by referencing his effort in Dockerizing Brozzler and his other work in component-izing and Dockerized the other roles and tasks web archiving process with his project Wren. While one such component is the Archival Acid Test project I had created for Digital Libraries 2014, the other sub-projects of run allow for the mitigation of other tools that are otherwise difficult or time consuming to configure.

One such tool that was lauded throughout the conference was Alex Osborne's (@atosborne) tinycdxserver Andy also has created a Dockerized version of tinycdxserver. This tool was new to me but the reported statistics on CDX querying speed and storage have the potential for significant improvement for large web archives. Per Alex's description of the tool, the indexes are stored compressed using Facebook's RocksDB and are about a fifth of the size in tinycdxserver when compared to a flat CDX file. Further, Wayback instances can simply be pointed at a tinycdxserver instance using the built-in RemoteResourceIndex field in the Wayback configuration file, which makes for easy integration.


A wholly unconference discussion then commenced with topics we wanted to cover in the second part of the day. After coming up with and classifying various idea, Andy defined three groups: the Heritrix Wish List, Brozzler, and Automated QA.

Each attendee could join any of the three for further discussion. I chose "Automated QA", given the relevance of archival integrity is related to my research topic.

The Heritrix group expressed challenges that the members had encountered in transitioning from Heritrix version 1 to version 3. "The Heritrix 3 console is a step back from Heritrix 1's. Building and running scripts in Heritrix 3 is a pain." was the general sentiment from the group. Other concerns were scarce documentation, which might be remedied with funded efforts to improve it, as deep knowledge of the tool's working are needed to accurately represent the capability of the tool. Kristinn SigurĂ°sson (@kristsi), who was involved in the development of H3 (and declined to give a history documenting the non-existence of H2) has since resolved some issues. I was encouraged to use his fork of Heritrix 3 from he and others, my own recommendation inadvertent included:

The Brozzler group first identified the behavior of Brozzler versus a crawler in its handling of one page or site at a time (a la WARCreate) instead of adding discovered URIs to a frontier and seeding those URIs for subsequent crawls. Per above, Brozzler's use of RethinkDB as both the crawl frontier and the CDX service makes it especially appealing and more scalable. Brozzler allows multiple workers to pull URIs for a pool and report back to a RethinkDB instance. This worked fairly well in my limited but successful testing at the hackathon.

The Automated QA group first spoke about the National Library of Australia's Bamboo project. The tool consumes Heritrix's (and/or warcprox) crawl output folder and provides in-progress indexes from WARC files prior to a crawl finishing. Other statistics can also be added in as well as automated generation of screenshots for comparison of the captures on-the-fly. We also highlighted some particular items that crawlers and browser-based preservation tools have trouble capturing. For example, video formats that vary in support between browsers, URIs defined in the "srcset" attribute, responsive design behaviors, etc. I also referenced my work in Ahmed AlSum's (@aalsum) Thumbnail Summarization using SimHash, as presented at the Web Archiving Collaboration meeting.

After presentation by the groups, the attendees called it a day for further discussions at a nearby pub.

Day 2

The second day commenced with a few questions we all decided upon and agreed to while at the pub as good discussions for the next day. These questions:

  1. Given five engineers and two years, what would you build?
  2. What are the barriers in training for the current and future crawling software and tools?
Given Five...

Responses to the first included something like Brozzler's frontier but redesigned to allow for continuous instead of a single URI for crawling. With a segue toward Heritrix, Kristinn verbally considered the relationship between configurability and scalability. "You typically don't install heritrix on a virtual machine", he said, "usually a machine for this use requires at least 64 gigabytes of RAM." Also discussed was getting the raw data for a crawl versus being able to get the data needed to replicate the experience and the particular importance of the latter.

Additionally, there was talk of adapting the scheme used by Brozzler for an Electron application meant for browsing and the ability to toggle archiving through warcprox (related: see recent post on WAIL). On the flip side, Kristinn mentioned that it surprised him that we can potentially create a browser of this sort that can interact with a proxy but not build another crawler -- highlight the lack of options in other Heritrix-like robust archival crawlers.

Barriers in Training

For the second question, those involved with institutional archives seemed to agreed that if one was going to hire a crawl engineer, Java and Python experience are a pre-requisite to exposure to some of the archive-specific concepts. For current institutional training practice, Andy stated that he turns new developers in his organization loose on ACT, which is simply a CRUD application to introduce them into the web archiving domain. Others said it would be useful to have staff exchanges and internships for collaboration and getting more employees familiar with web archiving.


Another topic arose from the previous conversation about future methods of collaboration. For future work on writing documentation, more Getting Started fundamental guides as well as test sites for tools would be welcomed. For future communication, the IIPC Slack Channel as well as the newly created IIPC GitHub wiki will be the next iteration of the outdated IIPC Tools page and the COPTR initiative.

The whole-group discussion wrapped up with identifying concrete next steps from what was discussed at the event. These included creating setup guides for Brozzler, testing of any further use cases of Umbra versus Brozzler, future work on access control considerations as currently done by institutions and next steps regarding that, and a few other TODOs. A monthly online meeting is also planned to facilitate collaboration between meetings as well as more continued interaction via Slack instead of a number of outdated, obsolete, or noisy e-mail channels.

In Conclusion...

Attendance of the IIPC Building Better Crawlers Hackathon was invaluable to establishing contacts and gaining more exposure to the field and efforts done by others. Many of the conversations were open-ended, which lead to numerous other topics discussed and opened the doors to potential new collaborations. I gained a lot of insight from discussing my research topic and others' projects and endeavors. I hope to be involved with future Hackathons-turned-Unconferences from IIPC in the future and appreciate the opportunity I had to attend.

—Mat (@machawk1)

Kristinn SigurĂ°sson has also written a post about his take aways from the event.

Tom Cramer also published his report on the Hackathon since the publication of this post.

Wednesday, September 21, 2016

2016-09-20: The promising scene at the end of Ph.D. trail

From right to left, Dr. Nelson (my advisor),
Yousof (my son), Yasmin (myself), Ahmed (my husband)
August 26th marked my last day as a Ph.D. student in the Computer Science department at ODU, while September 26 marks my first day as a Postdoctoral Scholar in Data Curation for the Sciences and Social Sciences at UC Berkeley. I will lead research in the areas of software curation, data science, and digital research methods. I will be honored to work under the supervision of Dr. Erik Mitchell, the Associate University Librarian and Director of Digital Initiatives and Collaborative Services at the University of California, Berkeley. I will have an opportunity to collaborate with many institutions across UC Berkeley, including the Berkeley Institute for Data Science (BIDS) research unit. It is amazing to see the light at the end of the long tunnel. Below, I talk about the long trail I took to reach my academic dream position. I'll recap the topic of my dissertation, then I'll summarize lessons learned at the end.

I started my Ph.D. in January 2011 at the same time that the uprisings of the Jan 25 Egyptian Revolution began. I was witnessing what was happening in Egypt while I was in Norfolk, Virginia. I could not do anything during the 18 days except watch all the news and social media channels, witnessing the events. I wished that my son Yousof, who was less than 2 years old at that time, could know what was happening as I saw it. Luckily, I knew about Archive-It, a subscription service by the Internet Archive that allows institutions to develop, curate, and preserve collections of Web resources. Each collection in Archive-It has two dimensions: time and URI. Understanding the contents and boundaries of these archived collections is a challenge for most people, resulting in the paradox of the larger the collection, the harder it is to understand.

There are multiple collections in Archive-It about the Jan. 25 Egyptian Revolution 

There is more than collection documenting the Arab Spring and particularly the Egyptian Revolution. Documenting long-running events such as the Egyptian Revolution results in large collections that have 1000s of URIs and each URI has 1000s of copies through time. It is challenging for my son to pick a specific collection to know the key events of the Egyptian revolution. The topic of my dissertation, which was entitled "Using Web Archives to Enrich the Live Web Experience Through Storytelling", focused on understanding the holdings of the archived collections.
Inspired by “It was a dark and stormy night”, a well-known storytelling trope:  
We named the proposed framework the Dark and Stormy Archive (DSA) framework, in which we integrate “storytelling” social media and Web archives. In the DSA framework, we identify, evaluate, and select candidate Web pages from archived collections that summarize the holdings of these collections, arrange them in chronological order, and then visualize these pages using tools that users already are familiar with, such as Storify. An example of the output is bellow. It shows three stories for the three collections about the Egyptian Revolution. The user can gain an understanding about the holdings of each collection from the snippets of each story.

The story of the Arab Spring Collection

The story of  the North Africa and the Middle East collection

The story of the Egyptian Revolution collection

With the help of Archive-It team and partners, we obtained a ground truth data set for evaluating the generated stories by the DSA framework. We used Amazon Mechanical Turk to evaluate the automatically generated stories against the stories that were created by domain experts. The results show that the automatically generated stories by the DSA are indistinguishable from those created by human subject domain experts, while at the same time both kinds of stories (automatic and human) are easily distinguished from randomly generated stories. I successfully defended my Ph.D. dissertation on 06/16/2016.

Generating persistent stories from themed archived collections will ensure that future generations will be able to browse the past easily. I’m glad that Yousof and future generations will be able to browse and understand the past easily through generated stories that summarize the holding of the archived collections.


To continue WS-DLer’s habit in providing recaps, lessons learned, and recommendations, I will list some of the lessons learned for what it takes to be a successful Ph.D. student and advice for applying in academia. I hope these lessons and advice will be useful for future WS-DLers and grad students. Lessons learned and advice:
  • The first one  and the one I always put in front of me: You can do ANYTHING!!

  • Getting involved in communities in addition to your academic life is useful in many ways. I have participated in many women in technology communities such as the Anita Borg Institute and the Arab Women in Computing (ArabWIC) to increase the inclusion of women in technology. I was awarded travel scholarships to attend several well-known women in tech conferences: CRA-W (Graduate Cohort 2013), Grace Hopper Celebration of Women in Computing (GHC) 2013, GHC 2014, GHC 2015, and ArabWIC 2015. I am a member of the leadership committee of ArabWIC. Attending these meetings grows maturity and enlarges personal connections and development that prepare students for future careers. I also gained leadership skills from being part of the leadership committee of ArabWIC. 
  • Publications matter! if you are in WS-DL, you will have to get the targeted score đŸ˜‰. You can know more about the point system on the wiki. If you plan to apply in academia, the list of publication is a big issue. 
  • Teaching is important for applying in academia. 
  • Collaboration is a key for increasing your connections and also will help in developing your skills for working in teams. 
  • And at last, being a mom holding a Ph.D. is not easy at all!!
The trail was not easy, but it is worth it. I learned and have changed much since I started the program. Having enthusiastic and great advisors like Dr. Nelson and Dr. Weigle is a huge support that results in happy ending and achievement to be proud of.


Tuesday, September 20, 2016

2016-09-20: Carbon Dating the Web, version 3.0

Due to API changes, the old carbon date tool is out of date and some modules no longer work, such as topsy. I have taken up the responsibility of maintaining and extending  the service, beginning with the following now available in Carbon Date v3.0.

Carbon date 3.0

What's new

New services have been added, such as bing searching, twitter searching and pubdate parsing.

The new software architecture enable us to load given scripts or disable given services during runtime.

The server framework has been changed from CherryPy server to tornado server which is still a python minimalist WSGI server, with better performance.

How to use the Carbon Date service

  • Through the website, Given that carbon dating is computationally intensive, the site can only hold 50 concurrent requests, and thus the web service should be used just for small tests as a courtesy to other users. If you have the need to Carbon Date a large number of URLs, you should install the application locally.  Note that the old link still works.
  •  Through local installation: The project source can be found at the following repository: Consult for instructions on how to install the application.

Dockerizing the Carbon Date Tool

Carbon Date now only supports python 3. Due to potential  package conflicts between python 2 and python 3 (most machine have python 2 installed as default), we recommend running Carbon Date in docker.

Build docker image from source
  1.  Install the docker.
  2. Clone the git hub source to local directory.
  3. Run 
  4. Then you can choose either server or local mode
    • server mode

      Don't forget to mapping your port to server port in container.
      Then in the browser visit

      for index page or
      in the terminal

      for direct query
    • local mode
or get deployed image automatically from dockerhub :

System Design

In order to make Carbon Date tool easier to maintain and develop, the structure of the application has been refactored.  The system now has four layers:

When a query has been sent to application, the query proceed as following:

Add new module to Carbon Date

Now all the modules are loaded and executed automatically. The module manipulator will try to search and call the entry function of each module. A new module can be loaded and executed automatically without altering other scripts if it define the function in the way described below.

Name the module main script as cdGet<Module name>.py
And ensure the entry function is named:

or customize your own entry function name by assign string value to 'entry' variable in the beginning of your script.

For example, a new module using as search engine to find potential creation date of a URI. The script should be named  And the entry function should be:

The will pass outputArray, indexOfOutputArray and "displayArray"in the kwargs into the function. Note that outputArray is for to compute the earliest creation date, so only one value should be assigned here. And the displayArray is for return value, it can be the same as result creation date or anything else in the form of an array of tuples.

In this example, when we get the result from, the code to return these value is:

Source maintenance

Some web service may change, so some modules should be updated frequently.

Here, the twitter module should be updated when twitter has changed their page hierarchy. Because currently crawls the twitter search page and parses the  time stamp of each tweet in the result. So the old algorithm may not work when twitter moves the tweets' time stamp to other tags in the future.

Thus the twitter script should be updated periodically until twitter allows users to get old tweets more than one week ago through the twitter api.

I am grateful to everyone who helped me on Carbon Date especially Sawood Alam, who helped greatly with deploying the server and countless advice about refactoring the application, and John Berlin who advised me to use tornado instead of cherryPy. Further recommendations or comments about how this service can be improved are welcome and will be appreciated.


Tuesday, September 13, 2016

2016-09-13: Memento and Web Archiving Colloquium at UVa

Yesterday, September 12, I went to the University of Virginia to give a colloquium at the invitation of Robin Ruggaber to talk with her staff about Memento, Web Archiving, and related technologies.  I also had the pleasure of meeting with Worthy Martin of the CS department and the Institute for Advanced Technology in the Humanities.  I met Robin at CNI Spring 2016 and she was intrigued by our work at using storytelling to summarize archival collections, and was hoping to apply it to their Archive-It collections (which are currently not public).  My presentation yesterday was more of an overview of web archiving,  although the discussion did cover various details, including a proposal for Memento versioning in Fedora