Friday, April 19, 2013

2013-04-19: Carbon Dating the Web

In the course of our research we often needed to determine when a certain web resource was created. In numerous cases, this question is fairly straightforward to answer by examining the resource itself. Articles often have publishing datetime stamps, social media contributions have posting time, and others you can estimate the creation date from reading the resource itself. This process is simple upon manually examining the resource, but when the dataset of resources is large it is harder to automate.
To solve this problem we conducted several experiments to determine when the resource was created automatically. When a resource is created it often gets indexed in the search engines, archived in the public archives, and shared in the social media thus leaving trails of existence. We trace those trails of existence and use the first appearance of the first trail as a close estimate of the creation date. The timeline below illustrates a common scenario of the lifetime of a resource.




We also examined the existence of a last modified timestamp in the resource’s header and the feasibility of using it as an estimate of creation date. We also examine the resource’s backlinks and in turn estimate their creation date which could be easier to extract, which gives us an insight on when the resource was created too.

In order to test the accuracy of our estimation we collected 1200 resources which we can manually extract the creation date from different sources. We tested our model and were able to estimate a creation date to over 75% of the resources and 33% having the exact creation date.  After validating our model we utilized it in building an age estimation service which if provided with the resource’s URL would return a JSON object of the creation dates from each source (search engines, archives, social media, backlinks, and others) and the estimated lowest creation date. You can use the service at:  http://cd.cs.odu.edu/cd/<YOUR_URL_HERE>

curl -i http://cd.cs.odu.edu/cd/http://www.mementoweb.org

HTTP/1.0 200 OK
Date: Fri, 01 Mar 2013 04:44:47 GMT
Server: WSGIServer/0.1 Python/2.6.5
Content-Length: 550
Content-Type: application/json; charset=UTF-8

{
      "URI": "http://www.mementoweb.org",
      "Estimated Creation Date": "2009-09-30T11:58:25",
      "Last Modified": "2012-04-20T21:52:07",
      "Bitly": "2011-03-24T10:44:12",
      "Topsy.com": "2009-11-09T20:53:20",
      "Backlinks": "2011-01-16T21:42:12",
      "Google.com": "2009-11-16",
      "Archives": {
            "Earliest": "2009-09-30T11:58:25",
            "By Archive": {
                  "wayback.archive-it.org": "2009-09-30T11:58:25",
                  "api.wayback.archive.org": "2009-09-30T11:58:25",
                  "webarchive.nationalarchives.gov.uk": "2010-04-02T00:00:00"
            }
      }
}


We published the code implemented as well in GitHub. You can download it from: https://github.com/HanySalahEldeen/CarbonDate  along with the instructions to install. To use this service, you should register with Bitly and Topsy and get their corresponding API keys. Second, modify the config file by adding your keys. Finally, launch server.py on your designated IP and port.

This work has been published at the third annual Temp Web workshop at the WWW 2013 conference in Rio de Janeiro, Brazil.


- Hany M. SalahEldeen, Michael L. Nelson, Carbon Dating The Web: Estimating the Age of Web Resources, Proceedings of TempWeb03, WWW 2013. (Also available as a Technical Report http://arxiv.org/abs/1304.5213).

6 comments:

  1. http://cd.cs.odu.edu/cd/
    No longer seems to work

    ReplyDelete
  2. Our apologies, the server was down for maintenance. Now it is up and running.

    ReplyDelete
  3. "Topsy.com": "Topsy Key has expired",

    Sorry to keep seeming to moan - your website is very useful to anyone researching a news story.
    mikej
    www.i-programmer.info

    ReplyDelete
  4. Hi Mike: the various keys (e.g., topsy, bitsy) are rate limited; when they exceed X requests/hour, you have to wait until the next hour (or day or whatever). Since this is a public demo service, we can't really control who or how much this has been used. Your best bet is to install your own copy of CD w/ your own keys: https://github.com/HanySalahEldeen/CarbonDate it should not be too hard, and it will ensure you get timely results. we're glad you like CD!

    ReplyDelete
  5. Thanks for that - it makes perfect sense.
    We did a news item on it because of me saying how useful it was :-)
    Have a look at
    http://www.i-programmer.info/news/81-web-general/5939-carbon-dating-the-web.html
    and tell us if we got anything wrong.

    ReplyDelete
  6. The article looks good to me -- much thanks for writing it!

    ReplyDelete