Monday, June 26, 2017

2017-06-26: IIPC Web Archiving Conference (WAC) Trip Report

Mat Kelly reports on the International Internet Preservation Consortium (IIPC) Web Archiving Conference (WAC) 2017 in London, England.                            

In the latter part of Web Archiving Week (#waweek2017) from Wednesday to Friday, Sawood and I attended the International Internet Preservation Consortium (IIPC) Web Archiving Conference (WAC) 2017, held jointly with the RESAW Conference at the Senate House and British Library Knowledge Center in London. Each of the three days had multiple tracks. Reported here are the presentations I attended.

Prior to the keynote, Jane Winters (@jfwinters) of University of London and Nicholas Taylor (@nullhandle) welcomed the crowd with admiration toward the Senate House venue. Leah Lievrouw (@Leah53) from UCLA then began the keynote. In her talk, she walked through the evolution of the Internet as a medium to access information prior to and since the Web.

With reservation toward the "Web 3.0" term, Leah described a new era in the shift from documents to conversations, to big data. With a focus toward the conference, Leah described the social science and cultural break down as it has applied to each Web era.

After the keynote, two concurrent presentation tracks proceeded. I attended a track where Jefferson Bailey (@jefferson_bail) presented "Advancing access and interface for research use of web archives". First citing an updated metric of the Internet Archive's holdings (see Ian's tweet below), Jefferson provided a an update on some contemporary holdings and collections by IA inclusive of some of the details on his GifCities project (introduced with IA's 20th anniversary, see our celebration), which provides searchable access to the the archive's holdings of the animated GIFs that once resided on Geocities.com.

In addition to this, Jefferson also highlighted the beta features of the Wayback Machine, inclusive of anchor text-based search algorithm, MIME-type breakdown, and much more. He also described some other available APIs inclusive of one built on top of WAT files, a metadata format derived from WARC.

Through recent efforts by IA for their anniversary, they also had put together a collection of military PowerPoint slide decks.

Following Jefferson, Niels Brügger (@NielsBr) lead a panel consisting of a subset of authors from the first issue of his journal, "Internet Histories". Marc Weber stated that the journal had been in the works for a while. When he initially told people he was looking at the history of the Web in the 1990s, people were puzzled. He went on to compare the Internet to be in its Victorian era as evolved from 170 years of the telephone and 60 years of being connected through the medium. Of the vast history of the Internet we have preserved relatively little. He finished with noting that we need to treat history and preservation as something that should be done quickly, as we cannot go back later to find the materials if they are no preserved.

Steve Jones of University of Illinois at Chicago spoke second about the Programmed Logic for Automatic Teaching Operations (PLATO) system. There were two key interests, he said, in developing for PLATO -- multiplayer games and communication. The original PLATO lab was in a large room and because of laziness, they could not be bothered to walk to each other's desks, so developed the "Talk" system to communicate and save messages so the same message would not have to be communicated twice. PLATO was not designed for lay users but for professionals, he said, but was also used by university and high school students. "You saw changes between developers and community values," he said, "seeing development of affordances in the context of the discourse of the developers that archived a set of discussions." Access to the PLATO system is still available.

Jane Winters presented third on the panel stating that there is a lot of archival content that has seen little research engagement. This may be due to continuing work on digitizing traditional texts but it is hard to engage with the history of the 21st century without engaging with the Web. The absence of metadata is another issue. "Our histories are almost inherently online", she said, "but they only gain any real permanence through preservation in Web archives. That's why humanists and historians really need to engage with them."

The tracks then joined together for lunch and split back into separate sessions, where I attended the presentation, "A temporal exploration of the composition of the UK Government Web Archive". In this presentation they examined the evolution of the UK National Archives (@uknatarchives). This was followed by a presentation by Caroline Nyvang (@caobilbao) of the Royal Danish Library that examined current web referencing practices. Her group proposed the persistent web identifier (PWID) format for referencing Web archives, which was eerily familiar to the URI semantics often used in another protocol.

Andrew (Andy) Jackson (@anjacks0n) then took the stage to discuss the UK Web Archive's (@UKWebArhive) catalog and challenges they have faced while considering the inclusion of Web archive material. He detailed a process, represented by a hierarchical diagram, to describe the sorts of transformations required in going from the data to reports and indexes about the data. In doing so, he also juxtaposed and compared his process with other archival workflows that would be performed in a conventional library catalog architecture.

Following Andy, Nicola Bingham (@NicolaJBingham) discussed curating collections at the UK Web Archive, which has been archiving since 2013, and challenges in determine the boundaries and scope of what should be collected. She encouraged researchers to engage to shape their collections. Their current holdings consist of about 400 terabytes with 11 to 12 billion records, growing 60 to 70 terabytes and 3 billion records per year. Their primary mission is to collect UK web sites under UK TLDs (like .uk, .scot, .cymru, etc). Domains are currently capped at 512 megabytes being preserved but even then other technical limitations exists in capture like proprietary formats, plugins, robots.txt, etc).

When Nicola finished, there was a short break. Following that, I traveled upstairs of the Senate House to the "Data, process, and results" workshop, lead by Emily Maemura (@emilymaemura). She first described three different research projects where each of the researchers were present and asked attendees to break out into groups to discuss the various facets of each project in detail with each researcher. I opted to discuss Frederico Nanni's (@f_nanni) work with him and a group of other attendees. His work consisted of analyzing and resolving issues in the preservation of the web site of the University of Bologna. The site specifies a robots.txt exclusion, which makes the captures inaccessible to the public but through his investigation and efforts, was able to change their local policy to allow for further examination of the captures.

With the completion of the workshop, everyone still in attendance joined back together in the Chancellor's Hall of the Senate House as Ian Milligan (@ianmilligan1) and Matthew Weber (@docmattweber) gave a wrap up of the Archives Unleashed 4.0 Datathon, which had occurred prior to the conference on Monday and Tuesday. Part of the wrap-up was time given to three top ranked projects as determined by judges from the British Library. The group with which I was a part from the Datathon, "Team Intersection" was one of the three, so Jess Ogden (@jessogden) gave a summary presentation. More information on our intersection analysis between multiple data sets can be had on our GitHub.io page. A blog post with more details will be posted here in the coming days detailing our report of the Datathon.

Following the AU 4.0 wrap-up, the audience moved to the British Library Knowledge Center for a panel titled, "Web Archives: truth, lies and politics in the 21st century". I was unable to attend this, opting for further refinement of the two presentations I was to give on the second day of IIPC WAC 2017 (see below).

Day Two

The second day of the conference was split into three concurrent tracks -- two at the Senate House and a third at the British Library Knowledge Center. Given I was slated to give two presentations at the latter (and the venues were about 0.8 miles apart), I opted to attend the sessions at the BL.

Nicholas Taylor opened the session with the scope of the presentations for the day and introduced the first three presenters. First on the bill was Andy Jackson with "Digging document out of the web archives." This initially compared this talk to the one he had given the day prior (see above) relating to the workflows in cataloging items. In the second day's talk, he discussed the process of the Digital ePrints team and the inefficiencies of its manual process for ingesting new content. Based on this process, his team setup a new harvester that watches targets, extracts the document and machine-readable metadata from the targets, and submits it to the catalog. Still though, issues remainder with one being what to identify as the "publication" for e-prints relative to the landing page, assets, and what is actually cataloged. He discussed the need for further experimentation using a variety of workflows to optimize the outcome for quality and to ensure the results are discoverable and accessible and the process remain mostly automated.

Ian Milligan and Nick Ruest (@ruebot) followed Andy with their presentation on making their Canadian web archival data sets easier to use. "We want web archives to be used on page 150 in some book.", they said, reinforcing that they want the archives to inform the insights instead of the subject necessarily being about the archives themselves. They also discussed their extraction and processing workflow from acquiring the data from Internet Archive then using Warcbase and other command-line tools to make the data contained within the archives more accessible. Nick said that since last year when they presented webarchives.ca, they have indexed 10 terabytes representative of over 200 million Solr docs. Ian also discussed derivative datasets they had produced inclusive of domain and URI counts, full-text, and graphs. Making the derivative data sets accessible and usable by researchers is a first step in their work being used on page 150.

Greg Wiedeman (@GregWiedeman) presented third in the technical session by first giving context of his work at the University at Albany (@ualbany) where they are required to preserve state records with no dedicated web archives staff. Some records have paper equivalents like archived copies of their Undergraduate Bulletins while digital versions might consist of Microsoft Word documents corresponding to the paper copies. They are using DACS to describe archives, so questioned whether they should use it for Web archives. On a technical level, he runs a Python script to look at their collection of CDXs, which schedules a crawl which is displayed in their catalog as it completes. "Users need to understand where web archives come from,", he says, "and need provenance to frame their research questions, which will add weight to their research."

A short break commenced, followed by Jefferson Bailey presenting, "Who, what when, where, why, WARC: new tools at the Internet Archive". Initially apologizing for repetition of his prior days presentation, Jefferson went into some technical details of statistics IA has generated, APIs they have to offer, and new interfaces with media queries of a variety of sorts. They also have begun to use Simhash to identify dissimilarity between related documents.

I (Mat Kelly, @machawk1) presented next with "Archive What I See Now – Personal Web Archiving with WARCs". In this presentation I described the advancements we had made to WARCreate, WAIL, and Mink with support from the National Endowment for the Humanities, which we have reported on in a few prior blog posts. This presentation served as a wrap-up of new modes added to WARCreate, the evolution of WAIL (See Lipstick or Ham then Electric WAILs and Ham), and integration of Mink (#mink #mink #mink) with local Web archives. Slides below for your viewing pleasure.

Lozana Rossenova (@LozanaRossenova) and Ilya Kreymer (@IlyaKreymer) talked next about Webrecorder and namely about remote browsers. Showing a live example of viewing a web archive with a contemporary browser, technologies that are no longer supported are not replayed as expected, often not being visible at all. Their work allows a user to replicate the original experience of the browser of the day to use the technologies as they were (e.g., Flash/Java applet rendering) for a more accurate portrayal of how the page existed at the time. This is particularly important for replicating art work that is dependent on these technologies to display. Ilya also described their Web Archiving Manifest (WAM) format to allow a collection of Web archives to be used in replaying Web pages with fetches performed at the time of replay. This patching technique allows for more accurate replication of the page at a time.

After Lozana and Ilya finished, the session broke for lunch then reconvened with Fernando Melo (@Fernando___Melo) describing their work at the publicly available Portuguese Web Archive. He showed their work building an image search of their archive using an API to describe Charlie Hebdo-related captures. His co-presenter João Nobre went into further details of the image search API, including the ability to parameterize the search by query string, timestamp, first-capture time, and whether it was "safe". Discussion from the audience afterward asked of the pair what their basis was of a "safe" image.

Nicholas Taylor spoke about recent work with LOCKSS and WASAPI and the re-architecting of the former to open the potential for further integration with other Web archiving technologies and tools. They recently built a service for bibliographic extraction of metadata for Web harvest and file transfer content, which can then be mapped to the DOM tree. They also performed further work on an audit and repair protocol to validate the integrity of distributed copies.

Jefferson again presented to discuss IMLS funded APIs they are developing to test transfers using WASAPI to their partners. His group ran surveys to show that 15-20% of Archive-It users download their WARCs to be stored locally. Their WASAPI Data Transfer API returns a JSON object derived from the set of WARCs transfered inclusive of fields like pagination, count, requested URI, etc. Other fields representative of an Archive-It ID, checksums, and collection information are also present. Naomi Dushay (@ndushay) then showed a video of an overview of their deployment procedure.

After another short break, Jack Cushman & Ilya Kreymer tag-teamed to present, "Thinking like a hacker: Security Issues in Web Capture and Playback". Through a mock dialog, they discussed issues in securing Web archives and a suite of approaches challenging users to compromise a dummy archive. Ilya and Jack also iterated through various security problems that might arise in serving, storing, and accessing Web archives inclusive of stealing cookies, frame highjacking to display a false record, banner spoofing, etc.

Following Ilya and Jack, I (@machawk1, again) and David Dias (@daviddias) presented, "A Collaborative, Secure, and Private InterPlanetary WayBack Web Archiving System using IPFS". This presentation served as follow-on work from the InterPlanetary Wayback (ipwb) project Sawood (@ibnesayeed) had originally built at the Archives Unleashed 1.0 then presented at JCDL 2016, WADL 2016, and TPDL 2016. This work, in collaboration with David of Protocol Labs, who created the InterPlanetary File System (IPFS), was to display some advancements both in IPWB and IPFS. David began with an overview of IPFS, what problem its trying to solve, its system of content addressing, and mechanism to facilitate object permanence. I discussed, as with previous presentations, IPWB's integration of web archive (WARC) files with IPFS using an indexing and replay system that utilize the CDXJ format. One item in David's recent work is bring IPFS to the browsers with his JavaScript port to interface with IPFS from the browsers without the need for a running local IPFS daemon. I had recent introduced encryption and decryption of WARC content to IPWB, allowing for further permanence of archival Web data that may be sensitive in nature. To close the session, we performed a live demo of IPWB consisting of data replication of WARCs from another machine onto the presentation machine.

Following our presentation, Andy Jackson asked for feedback on the sessions and what IIPC can do to support the enthusiasm for open source and collaborative approaches. Discussions commenced among the attendees about how to optimize funding for events, with Jefferson Bailey reiterating the travel eats away at a large amount of the cost for such events. Further discussions were had about why the events we not recorded and on how to remodel the Hackathon events on the likes of other organizations like Mozilla's Global Sprints, the organization of events by the NodeJS community, and sponsoring developers for the Google Summer of Code. The audience then had further discussions on how to followup and communicate once the day was over, inclusive of the IIPC Slack Channel and the IIPC GitHub organization. With that, the second day concluded.

Day 3

By Friday, with my presentations for the trip complete, I now had but one obligation for the conference and the week (other than write my dissertation, of course): to write the blog post you are reading. This was performed while preparing for JCDL 2017 in Toronto the following week (that I attended by proxy, post coming soon). I missed out on the morning sessions, unfortunately, but joined in to catch the end of João Gomes' (@jgomespt) presentation on Arquivo.pt, also presented the prior day. I was saddened to know that I had missed Martin Klein's (@mart1nkle1n) "Uniform Access to Raw Mementos" detailing his, Los Alamos', and ODU's recent collaborative work in extending Memento to support access to unmodified content, among other characteristics that cause a "Raw Memento" to be transformed. WS-DL's own Shawn Jones (@shawnmjones) has blogged about this on numerous occasions, see Mementos in the Raw and Take Two.

The first full session I was able to attend was Abbie Grotke's (@agrotke) presentation, "Oh my, how the archive has grown..." that detailed the progress and size that Library of Congress's Web archive has experienced with minimal growth in staff despite the substantial increase in size of their holdings. While captivated, I came to know via the conference Twitter stream that Martin's third presentation of the day coincided with Abbie's. Sorry, Martin.

I did manage to switch rooms to see Nicholas Taylor discuss using Web archives in legal cases. He stated that in some cases, social media used by courts may only exist in Web archives and that courts now accept archival web captures as evidence. The first instance of using IA's Wayback Machine was in 2004 and its use in courts has been contested many times without avail. The Internet Archive provided affidavit guidance that suggested asking the court to ensure usage of the archive will consider captures as valid evidence. Nicholas alluded to FRE 201 that allows facts to be used as evidence, the basis for which the archive has been used. He also cited various cases where expert testimony of Web archives was used (Khoday v. Symantec Corp., et al.), a defamation case where the IA disclaimer dismissed using it as evidence (Judy Stabile v. Paul Smith Limited et al.), and others. Nicholas also cited WS-DL's own Scott Ainsworth's (@Galsondor) work on Temporal Coherence and how a composite memento may not have existed as displayed.

Following Nicholas, Anastasia Aizman and Matt Phillips (@this_phillips) presented "Instruments for Web archive comparison in Perma.cc". In their work with Harvard's Library Innovation Lab (with which WS-DL's Alex Nwala was recently a Summer fellow), the Perma team has a goal to allow users to cite things on the Web, create WARCs of those things, then be able to organize the captures. Their initial work with the Supreme Court corpus from 1996 to present found that 70% of the references had rotted. Anastasia asked, "How do we know when a web site has changed and how do we know which changed are important?"

They used a variety of ways to determine significant change inclusive of MinHas (via calculating the Jaccard Coefficients), Hamming Distance (via SimHash), and Sequence Matching using a Baseline. As a sample corpus, they took over 2,000 Washington Post articles consisting of over 12,000 resources, examined the SimHash and found big gaps. For MinHash, the distances appeared much closer. In their implementation, they show this to the user on Perma via their banner that provides an option to highlight file changes between sets of documents.

There was a brief break then I attended a session where Peter Webster (@pj_webster) and Chris Fryer (@C_Fryer) discussed their work with the UK Parliamentary Archives. Their recent work consists of capturing official social media feeds of the members of parliament, critical as it captures their relationship with the public. They sought to examine the patterns of use and access by the members and determine the level of understanding of the users of their archive. "Users are hard to find and engage", they said, citing that users were largely ignorant about what web archives are. In a second study, they found that users wanted a mechanism for discovery that mapped to an internal view of how the parliament function. Their studies found many things from web archives that user do not want but a takeaway is that they uncovered some issues in their assumptions and their study raised the profile of the Parliamentary Web Archives among their colleagues.

Emily Maemura and Nicholas Worby presented next with their discussion on origin studies as it relates to web archives, provenance, and trust. They examined decisions made in creating collections in Archive-It by the University of Toronto Libraries, namely the collections involving the Canadian political parties, the Toronto 2015 Pam Am games, and their Global Summitry Archive. From these they determined the three traits of each were that they were long running, a one-time event, and a collaboratively created archive, respectively. For the candidates' sites, they also noticed the implementation of robots.txt exclusions in a supposed attempt to prevent the sites from being archived.

Alexis Antracoli and Jackie Dooley (@minniedw) presented next about their OCLC Research Library Partnership web archive working group. Their examination determined that discoverability was the primary issue for users. Their example of using Archive-It at Princeton but that the fact was not documented was one such issue. Through their study they established use cases for libraries, archives, and researchers. In doing so, they created a data dictionary of characteristics of archives inclusive of 14 data elements like Access/rights, Creator, Description, etc. with many fields having a direct mapping to Dublin Core.

With a short break, the final session then began. I attended the session where Jane Winters (@jfwinters) spoke about increasing the visibility of web archives, asking first, "Who is the audience for Web archives?" then enumerating researchers in the arts, humanities and social sciences. She then described various examples in the press relating to web archives inclusive of Computer Weekly report on Conservatives erasing official records of speeches from IA and Dr. Anat Ben-David's work on getting the .yu TLD restored in IA.

Cynthia Joyce then discussed her work in studying Hurricane Katrina's unsearchable archive. Because New Orleans was not a tech savvy place at the time and it was pre-Twitter, Facebook was young, etc., the personal record was not what it would be were the events to happen today. In her researcher as a citizen, she attempted to identify themes and stories that would have been missed in mainstream media. She said, "On Archive-It, you can find the Katrina collection ranging from resistance to gratitude." Only 8-9 years later did she collect the information, for which many of the writers never expect to be preserved.

For the final presentation of the conference, Colin Post (@werrthe) discussed net-based art and how to go about making them objects of art history. Colin used Alexi Shulgin's "Homework" as an example that uses pop-ups and self-conscious elements that add to the challenge of preservation. In Natalie Bookchin's course, Alexei Shulgin encouraged artists to turn in homework for grading, also doing so himself. His assignment is dominated with popups, something we view in a different light today. "Archives do not capture the performative aspect of the piece", Colin said. Citing oldweb.today provides interesting insights into how the page was captured over time with multiple captures being combined. "When I view the whole piece, it is emulated and artificial; it is disintegrated and inauthentic."

Synopsis

The trip proved very valuable to my research. Not documented in this post was the time between sessions where I was able to speak to some of the presenters about their as it related to my own and even to those that were not presenting in finding an intersection in our respective research.

Mat (@machawk1)

Friday, June 9, 2017

2017-06-09: InfoVis Spring 2016 Class Projects

I'm way behind in posting about my Spring 2016 offering of CS 725/825 Information Visualization, but better late than never. (Previous semester highlights posts: Spring 2015, Spring/Fall 2013, Fall 2012, Fall 2011)
 
Here are a few projects that I'd like to highlight. (All class projects are listed in my InfoVis Gallery.)

Expanding the WorldVis Simulation
Created by Juliette Pardue, Mridul Sen, Christos Tsolakis


This project (available at http://ws-dl.cs.odu.edu/vis/world-vis/) was an extension of the FluNet visualization, developed as a class project in 2013. The students extended the specialized tool to account for general datasets of quantitative attributes per country over time and added attributes based on continent average. They also computed summary data for each dataset for each year, so at a glance, the user can see statistical information including the country with the minimum and maximum value.

This work was accepted as a poster to IEEE VIS 2016:
Juliette Pardue, Mridul Sen, Christos Tsolakis, Reid Rankin, Ayush Khandelwal and Michele C. Weigle, "WorldVis: A Visualization Tool for World Data," In Proceedings of IEEE VIS. Baltimore, MD, October 2016, poster abstract. (PDF, poster, trip report blog post)



Visualization for Navy Hearing Conservation Program (HCP)
Created by Erika Siregar (@erikaris), Hung Do (@hdo003), Srinivas Havanur


This project (available at http://www.cs.odu.edu/~hdo/InfoVis/navy/final-project/index.html) was also the extension of previous work. The first version of this visualization was built by Lulwah Alkwai.

The aim of this work is to track hearing level of workers in the US Navy over a period of time through Hearing Conservation Program (HCP). The HCP's goal is to detect and prevent a noise-induced hearing loss among the service members by analyzing their hearing levels over the years. The students analyzed the data obtained from the audiogram dataset to produce some interactive visualizations using D3.js to see hearing curves of workers over the years.



ODU Student Demographics
Created by Ravi Majeti, Rajyalakshmi Mukkamala, Shivani Bimavarapu

This project (available at http://webspace.cs.odu.edu/~nmajeti/InfoViz/World/worldmap-template.html) concentrates on ODU international student information. It visualizes the headcount of international graduate and undergraduate students studying at ODU for each country for a particular major in a selected year and visualizes the gender ratio for undergraduate and graduate students in the university for each year. The main goal is to provide an interactive interface for the prospective students to analyze the global diversity at ODU and identify whether ODU best suits their expectations in the aspects of alumni from their respective major and country.



Visualizing Web Archives of Moderate Size
Created by John Berlin (@johnaberlin), Joel Rodriguez-Ortiz, Dan Milanko


This work (available at http://jrodgz.github.io/project/), develops a platform for understanding web archives in a multi-user setting. The students used contextual data provided during the archival process to provide a new approach towards identifying the general state of the archives. This metadata allows us to identify the most common domains, archived resources, times and tags associated with a web collection. The designed tool outlines the most important areas of focus in web archives and gives users a more clear picture of what their collections comprise of, both in specific and general terms.




-Michele

Wednesday, April 26, 2017

2017-04-26: Discovering Scholars Everywhere They Tread


Though scholars write articles and papers, they also post a lot of content on the web. Datasets, blog posts (like this one), presentations, and more are posted by scholars as part of scholarly communications. What if we could aggregate the content by scholar, instead of by web site?

Why would we want to do this? We can create stories, or collections of a scholar's work in an interface, much like Storify. We can also index this information and create a search engine that allows a user to search by scholar and find all of their work, not just their published papers, as is offered by Scopus or Web of Science, but their web-based content as well. Finally we can archive their work before the ocean of link rot washes it away.

To accomplish our goal, two main questions must be answered: (1) For a given scholar, how do we create a global scholar profile describing the scholar and constructed from multiple sources? (2) How do we locate the scholar's work on the web and use this global scholarly profile to confirm that we have found their work?

Throughout this post I attempt to determine what resources could be used by a hypothetical automated system to build our global scholar profile and then use it to discover user information on scholarly portals. I also review some scholarly portals to determine what resources they provide that can be used with the global scholar profile. Note: our hypothetical system is currently just attempting to find the websites to which scholars post their content; discovering and processing the content itself is a separate issue.

Building a global scholar profile


Abdel-Hafez and Xu provide "A Survey of User Modeling in Social Media Websites". In that paper, they describe that "modeling users will have different methods between different websites". They discuss the work that has been done on constructing a user model from different social media sites, using a rather broad definition of social media that includes blogs and collaborative portals like wikis. They discuss the problems associated with building a user profile from social media, which inspires the term global scholar profile in this post.

They also provide an overview of the "cold start problem" where insufficient starting information is available to begin using a useful user profile. Existing solutions to the cold start problem in recommender systems, such as those by Lika, Kolomvatsos, and Hadjiefthymiades rely on the use of demographic data to create user profiles, which will not be useful for identifying scholars. Instead, we can use some existing sources containing information about scholars.

The EgoSystem project, by Powell, Shankar, Rodriguez, and Van de Sompel, concerned itself with building a global scholarly profile from several sources of scholarly information. It accepts a scholar's name, the university where they earned their PhD, their fields of study, their current affiliation, their current title, and some keywords noting their field of work. Using this information, the system starts with a web search using Yahoo BOSS search API (now defunct) with these input terms and the names of portals, such as LinkedIn, Twitter, and Slideshare. After visiting each page in the search results, the system awards points to a page for each string of data that matches. If the points reach a certain threshold, then the page is considered to be a good match for the scholar and additional data is then acquired via a site API -- or scraped from the web page, if necessary -- and added to the system's knowledge of the scholar for future iterations. This scoring system was insipred by Northern and Nelson's work on disambiguating university students' social media profiles. EgoSystem's data is stored in a graph database for future retrieval and updating, much like the semantic network profiles discussed by Gauch, Speretta, Chandramouli, and Micarelli.

Kramer and Boseman created the Innovations in Scholarly Communication project. As part of that project, they developed a list of 400+ Tools and Innovations in Scholarly Communication. Many of the tools on this list are scholarly portals, places where scholars post content.

Our hypothetical system must first build a global scholar profile that can be tested against content from various scholarly portals. To do so our automated system needs is data about a scholar. Many services exist which index and analyze scholar's published works from journals and conference proceedings. All of this can provide information to be used for disambiguation.

If we have access to all of this information, then we should be able to use EgoSystem's scoring method of disambiguation against scholarly portals. What if we do not yet have this information? Given just a name and an affiliation, from what sources can we construct a global scholar profile?

In the table below, I reviewed the documentation for several sources of information about scholars, based on their published works. In the access restrictions section I document which restrictions I have found for each source. Included in this table is the name of the web service, which data it provides that is useful to identify a scholar, and the access restrictions of the service. I reviewed each service, to determine which fields were available in the output. I did not sign up for any authentication keys, so the data useful for scholar identification  comes from each service's documentation. I also only included services that allow one to query by author name.

Service Data Useful for Scholar Identification Access Restrictions
arXiv API
  • Authors and Co-authors
  • Terms from titles
  • Terms from abstracts
  • Terms from documents
  • Affiliations
  • Keywords
None
Clarivate's Web of Science API
  • Authors and Co-authors
  • Terms from titles
  • Terms from abstracts
  • Terms from documents
  • Affiliations
  • Keywords
Institution Must Be Licensed Additional Restrictions on Data Usage
CrossRef REST API
  • Authors and Co-authors
  • Terms from titles
  • Affiliations
  • Keywords
None
Elsevier's Scopus API
  • Authors and Co-authors
  • Terms from titles
  • Terms from abstracts
  • Terms from documents
  • Affiliations
  • Keywords
Institution Must Be Licensed Additional Restrictions on Data Usage
Europe PMC database
  • Authors and Co-authors
  • Terms from titles
  • Terms from abstracts
  • Terms from documents
  • Affiliations
  • Keywords
None
IEEE Xplore Search Gateway
  • Authors and Co-authors
  • Terms from titles
  • Terms from abstracts
  • Affiliations
  • Keywords
None
Microsoft Academic Knowledge API
  • Authors and Co-authors
  • Terms from titles
  • Terms from abstracts
  • Journal/Proceedings Information
  • Affiliations
  • Keywords
Free for 10,000 queries/month, otherwise $0.25 per 1,000 calls
Nature.com OpenSearch API
  • Authors and Co-authors
  • Terms from titles
  • Links to landing pages
Non-Commercial Use Only All downloaded content must be deleted within 24 hour period Application requires a "Powered by nature.com" logo Requires signing up for authentication key
OCLC WorldCat Identities API
  • Authors
  • Terms from titles
Non-commercial use only
ORCID API
  • ORCID
  • Other Identifiers
  • Authors and Co-authors
  • Terms from titles
  • Journal/Proceedings Information
  • Links to landing pages
  • Employment
  • Education
  • Links to additional websites
  • Keywords
  • Biography
None
PLOS API
  • Authors and Co-authors
  • Terms from titles
  • Terms from abstracts
  • Terms from documents
  • Affiliations
  • Keywords
Rate limited to 10 requests per minute Data must be attributed to PLOS Requires signing up for authentication key
Springer API Service
  • Terms from Titles
  • Journal/Proceedings Information
  • Keywords
  • Links to landing pages
Requires signing up for authentication key

Some of these services are not free. Microsoft Academic Search API, Elsevier's Scopus, and Web of Science all provide information about scholars and their works, but with limitations and often for a fee. Microsoft Academic Search API has become Microsoft Academic Knowledge API and now limits the user to 10,000 calls per month unless they pay. Scopus API is free of charge, but "full API access is only granted to clients that run within the networks of organizations" with a Scopus subscription. Clarivate's Web of Science API provides access with similar restrictions, "using your institution's subscription entitlements".

There are also restrictions on how a system is permitted to use the data from Web of Science, including which fields can be displayed to the public. Scopus has similar restrictions on text and data mining, which may affect our system's ability to use these sources at all. Furthermore, the Nature.com OpenSearch API requires that any data acquired is refreshed or deleted within a 24-hour period, also making it unlikely to be useful to our system because the data cannot be retained.

Some organizations, such as PubMed Central, offer an OAI-PMH interface that can be used to harvest metadata. Our system can harvest this metadata and provide its own searches. Similarly, other organizations, such as the Hathi Trust Digital Library, offer downloadable datasets of their holdings. Data from API queries is more desirable because it will be more current than data obtained via datasets.

Not all of these sources are equally reliable for discovering information about scholars. For example, a recent study by Klein and Van de Sompel indicates that, in spite of the information scholars can provide about themselves on ORCID, many do not fill in data that would be useful for identification.

Because the global scholar profile is supposed to be the known good information for future disambiguation, the data gathered for the global scholar profile at this stage may need to be reviewed by a human before we trust it. For example, the screenshot below is from Scopus, and shows multiple entries for Herbert Van de Sompel which refer to the same person.

Scopus has multiple entries for "Herbert Van de Sompel".

Discovering Where Scholars Post Their Work


Once we have a global scholar profile for a scholar, we can search for their content on known scholarly portals. Several methods exist to discover hints as to which scholarly portals contain a scholar's content.

Homepages


If we know a scholar's homepage, it might be another potential source of links to additional content produced by that scholar. I decided to see if scholars acted this way. In August-September of 2016, I used Microsoft Academic API to find the homepages of the top 99 researchers from 13 different knowledge domains. From these 1287 scholarly records, they broke down as shown in the table below until I had 733 homepages with a 200 status. For those 733 homepages, I downloaded each homepage and extracted its links.


Total Records1287
Records without a homepage133
Homepages Resulting in soft-404s369
Homepages with connection errors61
Homepages has too many redirects1
Homepages with a 200 status723
Homepages containing one or more URIs from the list of scholarly tools204

Each link was then compared with the domain name of the tools listed in Kramer and Boseman's 400+ Tools and innovations.  Out of 723 homepages 204 (28.2%) contained 1 or more URIs matching a tool from that list. This does indicate that homepages could be used as a source of additional sites that may contain the work of the scholar in question.

Now that Microsoft Academic API has changed its terms, alternatives to finding homepages will be useful. Fang, Si, and Mathur tested several methods of detecting faculty homepages in web search engine results. Their study provides some direction on locating a scholar's work on the web. They used the Yahoo BOSS API to acquire search results. These search results were then evaluated for accuracy using site specific heuristics, logistic regression, SVM, and a joint prediction model. They discovered that the joint prediction model outperformed the other methods.

Social Media Profiles


In addition to scholarly databases, social media profiles may offer additional sources for us to find information about scholars. The social graph in services like Twitter and Facebook provides additional dimensions that can be analyzed.

For example, if we know an institution's Twitter account, how likely is it that a scholar follows this account? If we cannot find a scholar's Twitter account using their institution's Twitter account, can we discover them using link prediction techniques like Schall's triadic closeness.

In addition, there is ample work in discovering researchers on Twitter. For example, Hadgu and Jäschke used Twitter to determine the relationships between computer scientists, reviewing several machine learning algorithms to discover demographic information, topics, and the most influential computer scientists. Instead of looking at the institution's Twitter account as a base for finding computer scientists, they used the Twitter accounts of scientific conferences. Perhaps our hypothetical system can use conference information from a scholar's publication list in this way.

It is also possible that a scholar's social media posts contain links to websites where they post their data. We can use their social media feeds to discover links to scholarly portals and then disambiguate them.

Querying Portals Directly


On top of hints, we can query the portals directly using their native capabilities. Unfortunately, the same capabilities for finding scholars are not available at all portals. To discover these capabilities, I started with Kramer and Boseman's 400+ Tools and Innovations in Scholarly Communication. I sorted the list by number of Twitter followers as a proxy for popularity. I then filtered the list for those tools categorized as Publication, Outreach, or Assessment. Finally, I selected the first 36 non-journal portals for which I could find scholarly output hosted on the portal. I then reviewed the different ways of discovering scholars on these sites.

The table below contains the list of portals used in my review. In order to describe the nature of each site, I have classified them according to the categories used in Kaplan and Haenlein's "Users of the world, unite! The challenges and opportunities of Social Media". The categories used in the table below are:
  1. Social Networking applies to portals that allow users to create connections to other users, usually via a "friend" network or by "following". Examples: MethodSpace, Twitter
  2. Blogs encompasses everything from blog posts to magazine articles to forums. Examples: HASTAC, PubPeer, The Conversation
  3. Content Communities involves portals where users share media, such as datasets, videos, and documents, including preprints. Examples: Figshare, BioRxiv, Slideshare, JoVe
  4. Collaborative Works is reserved for portals where users collaboratively change a single produce, like a wiki page. Examples: Wikipedia


Portal Kaplan and Haenlein Social Media Classification
Academic Room Blogs, Content Communities
AskforEvidence Blogs
Benchfly Content Communities
BioRxiv Content Communities
BitBucket Content Communities
Dataverse Content Communities
Dryad Content Communities
ExternalDiffusion Blogs
Figshare Content Communities
GitHub Content Communities, Social Networking
GitLab.com Content Communities
Global biodiversity information facility Data Content Communities
HASTAC Blogs
Hypotheses Blogs
JoVe Content Communities
JSTOR daily Blogs
Kaggle Datasets Content Communities
Methodspace Social Networking, Blogs
Nautilus Blogs
Omeka.net Content Communities
Open Science Framework Content Communities
PubMed Commons Blogs
PubPeer Blogs
ScienceBlogs Blogs
Scientopia Blogs
SciLogs Blogs
Silk Content Communities
SlideShare Content Communities
SocialScienceSpace Blogs
SSRN Blogs
Story Collider Content Communities
The Conversation Blogs
The Open Notebook Blogs
United Academics Blogs
Wikipedia (& Wikimedia Commons) Collaborative Works
Zenodo Content Communities

Local Portal Search Engines

I wanted to know if I could find a scholar, by name, in this set of portals using local portal search engines. If such services are present on each portal, then our automated system could submit a scholar's name to the search engine and then scrape the results.

I reviewed whether or not the portal contained profile pages for its users. Profile pages are special web pages that contain user information that can be used to identify the scholar. Contained within a profile page might be the additional information necessary to identify that it belongs to the scholar we are interested in. This is important, because a profile page provides a single resource where we might be able verify that the user has an account on the portal. Without it, our system would need to go through the actual contributions to each portal.

For our 36 portals, 24 contained profile pages. This indicates that 24 portals associate some concept of identity with their content. With the exception of Academic Room, profiles in the portals also provide links to the scholar's contributions to the portal.

This screenshot shows a common case of a search engine that provides profiles in its search results. I have outlined one of the links to a profile in a red box and shown a separate screenshot of the linked profile.
Next, I reviewed each portal to discover if its local search engine, if present, provided profiles as search results. For 13 portals, the local search engine provided profile pages in their search results. This means that I was able to type a scholar's name into the portal's search bar and find their profile page directly linked from the result. In this case, an automated system would only need to scrape the search results pages to find the profile pages. Once the profile pages are acquired, the system can then compare them against what we know about the scholar to determine if the scholar has an identity on that portal. In some cases, a scraper can use pattern matching to eliminate the non-profile URIs from the list of results.

An example of a search engine providing profiles in its results is shown above with Figshare. In this case, searching for "Ana Maria Aguilera-Luque" on Figshare leads to a list of landing pages for uploaded content. Content on Figshare is associated with a user, and a clickable link to that user's profile shows up in the search results under the name of the uploaded content.
This screenshot shows an example of a search engine that does not provide profiles in its search results, even though the portal has profiles. The screenshots are of the search results, following the link to the document, and then following the link from the document to the profile page. Each followed link is outlined in a red box.
Unfortunately, this is not the case for all results. For 4 portals, the profile page is only available if one clicks on a search result link, and then clicks on the profile link from that search result. This increases the complexity of our automated system because now it must crawl through more pages before finding a candidate set of profiles to review.

The figure above shows an example of this case, where searching for "Chiara Civardi" on the magazine web site UA Magazine leads one to a list of articles. Each article contains a link to the profile of its author, thus allowing one to reach the scholar's profile.

This screenshot shows an example of a site that does not provide user profiles at all, but does provide search results if a scholar's name shows up in a document.

For 9 portals, the search results are the only source of information we have for a given scholar on the portal. Because the search results may be based on the search terms in the scholar's name, our automated system must crawl through some subset, possibly all, of the results to determine if the scholar has content on the given portal.

The figure above shows a search for "Heather Cucolo" on the the audio site "The Story Collider" which leads a user to the list of documents containing that string. Our automated system would need to review the content of the linked pages to determine if the Heather Cucolo we were searching for had content posted on this site.

And for 10 portals, the local search engine was not successful or did not exist. In these cases I had to resort to use a web search engine -- I used Google -- to find a profile page or content belonging to the scholar. I did so using the site search operator and the name of the scholar.

The table below shows the results of my attempt to manually find a scholar's work on each of the 36 portals.


Portal Profiles Exist? How did I find portal content based on actual scholar's name? How did I get from local search results to profile page?
Academic Room Yes Web Search
AskforEvidence No Local Search No profile, only search results
Benchfly Yes Web Search
BioRxiv No Local Search No profile, only search results
BitBucket Yes Web Search
Dataverse No Local Search No profile, only search results
Dryad No Local Search No profile, only search results
ExternalDiffusion No Local Search No profile, only search results
Figshare Yes Local Search Profile pages in results
GitHub Yes Local Search w/ Special Settings Profile pages in results, if correct search used
GitLab.com Yes Web Search
Global biodiversity information facility Data Yes Local Search Click on result, Profile linked from result page
HASTAC Yes Local Search Profile pages in results
Hypotheses Yes Web Search
JoVe Yes Local Search Profile pages in results
JSTOR daily Yes Local Search Click on result, Profile linked from result page
Kaggle Datasets Yes Local Search Profile pages in results
Methodspace Yes Local Search Profile pages in results
Nautilus 3 sentence science No Local Search No profile, only search results
Omeka.net No Web Search
Open Science Framework Yes Local Search Profile pages in results
PubMed Commons No Web Search
PubPeer No Local Search No profile, only search results
ScienceBlogs Yes Local Search Profile pages in results
Scientopia Yes Local Search Profile pages in results
SciLogs Yes Web Search
Silk No Web Search
SlideShare Yes Local Search Click on result, Profile linked from result page
SocialScienceSpace Yes Local Search Profile pages in results
SSRN Yes Local Search Profile pages in results
Story Collider No Local Search No profile, only search results
The Conversation Yes Local Search Profile pages in results
The Open Notebook Yes Web Search
United Academics Yes Local Search Click on result, Profile linked from result page
Wikipedia (& Wikimedia Commons) Yes Local Search w/ Special Settings Profile pages in results, if correct search used
Zenodo No Local Search No profile, only search results


Portal Web APIs

The result pages of local search engines must be scraped. A web API might provide structured data that can be used to effectively find the work of the scholar.

To search for web APIs for each portal, I used the following method:
  1. Look for the terms "developers", "API", "FAQ" on the main page of each portal. If present, follow those links to determine if the resulting resource contained further information on an API.
  2. Use the local search engine to search for these terms
  3. Use Google search with the following queries
    1. site:<hostname> rest api
    2. site:<hostname> soap api
    3. site:<hostname> api
    4. site:<hostname> developer
    5. <hostname> api
    6. <hostname> developer

Using this method, I could only find evidence of web APIs for 14 of the 36 portals. PubPeer's FAQ states that they have an API, but they request that API users contact them for more information, and I could not find their documentation online. I included PubPeer in this count, but was unable to review its documentation.

By reviewing the public API documentation, I was able to confirm that a search for scholars by name on 5 of the portals allowed one to match names to strings in multiple API fields. For example, the Dataverse API allows one to search for a string in multiple fields. The example response in the documentation is for the search term "finch", which does return a result containing an author name of "Finch, Fiona".

Like most software, some of these APIs were continuing to add functionality. For example, the current version of Zenodo's REST API allows users to deposit data and metadata. The beta version of this API provides the ability to "search published records", but this functionality is not yet documented. This functionality is expected to be available "in the autumn". Zenodo also provides an OAI-PMH interface, so a system could conceivably harvest metadata about all Zenodo records and perform its own searches for scholars.

Other APIs were did not provide the ability to search for users based on identity. Much like its local search engine, BitBucket's API requires that one know the id of the user before querying, which does not help us find scholars on their site. Omeka.net has an API, but Omeka.net contains many sites running the Omeka software. The users of these sites do not necessarily enable their API. Regardless, Omeka's API documentation states that "users cannot be browsed".  I was uncertain if this applied to search queries as well, but found no evidence in the documentation that they supported search of users, even as keywords.

Below are the results of my review of all 36 portals. It is possible that some of the portals marked "No" actually contain an API, but I was unable to find its documentation or evidence of it using the method above.


Portal Evidence of API Found
Academic Room No
AskforEvidence No
Benchfly No
BioRxiv No
BitBucket Yes
Dataverse Yes
Dryad Yes
ExternalDiffusion No
Figshare Yes
GitHub Yes
GitLab.com Yes
Global biodiversity information facility Data Yes
HASTAC No
Hypotheses No
JoVe No
JSTOR daily No
Kaggle Datasets No
Methodspace No
Nautilus 3 sentence science No
Omeka.net Yes
Open Science Framework Yes
PubMed Commons No
PubPeer Yes
ScienceBlogs No
Scientopia No
SciLogs No
Silk Yes
SlideShare Yes
SocialScienceSpace No
SSRN No
Story Collider No
The Conversation No
The Open Notebook No
United Academics No
Wikipedia (& Wikimedia Commons) Yes
Zenodo Yes

Web Search Engines


If local portal search engines and web APIs are ineffective, we can use web search engines, much like EgoSystem and Yi Fang's work. As noted above, I did need to use web search engines to find profiles for some users when the local portal search engine was either unsuccessful or nonexistent. Depending on the effectiveness of these site-specific services, web search engines may also be useful in lieu of the local search engine or API.

The table below shows four popular search engines, what data is available via their API, and what restrictions any system will encounter with each. As noted before, the Yahoo Boss API no longer exists, but is included because Yahoo! is a well known search engine. DuckDuckGo's Instant Answers API does not provide full search results due to digital rights issues, focusing on areas of topics, categories, and disambiguation. It is focused on topics, so "most deep queries (non topic names) will be blank". This leaves Bing and Google as the offerings that may help us, but they have restrictions on the number of times they can be accessed before limiting occurs.

Search Engine Data available via API Restrictions
Bing Free for 1K calls per month up to 3 months
DuckDuckGo Rate limited
Google 100 queries per day for free

$5 / 1000 queries up to 10K queries per day
Yahoo! Defunct as of March 31, 2016

Queries would likely be of a form like that use with EgoSystem, e.g., "LinkedIn+Marko+Rodriguez".

Because web search engines can return a large number of results, our hypothetical system would need to have limits on the number of results that it reviews. It would also need to determine the best queries to use for generating results for a given portal.

Crawling the Portal and Building Our Own Search Engine

If using web search is cost prohibitive or ineffective, we can potentially crawl the sites ourselves and produce our own search engine.

I evaluated each portal to determine if the website served a robots.txt file from its root directory in compliance with the Robots Exclusion Protocol. Using this file, the portal indicates to a search engine which URI paths it does not wish to have crawled using the keyword "disallow". Because the disallow applies only to certain paths or even certain crawlers, it may not apply to our hypothetical system. I discovered that 29 out of 36 portals have a robots.txt.

Portals may also have a sitemap exposing information about which URIs are available to the crawler. A link to a sitemap can be stored in the robots.txt. Sitemaps are also located at different paths on the portal. For example, http://www.example.com/path1/sitemap.xml is a sitemap that applies to the path /path1/ and will not contain information for URIs containing the string http://www.example.com/path2. I only examined if sitemaps were listed in the robots.txt or existed at the root directory for each portal. I discovered that 11 portals listed a sitemap in their robots.txt and 12 portals had a sitemap.xml or sitemap.xml.gz in their root directory.

The results of my review of these portals is shown below.


Portal Robots.txt present Sitemap in robots.txt Sitemap in root level directory
Academic Room Yes Yes
AskforEvidence Yes
Benchfly Yes Yes Yes
BioRxiv Yes Yes Yes
BitBucket Yes
Dataverse
Dryad Yes Yes
ExternalDiffusion
Figshare
GitHub Yes
GitLab.com Yes
Global biodiversity information facility Data Yes
HASTAC Yes Yes
Hypotheses Yes Yes
JoVe Yes Yes
JSTOR daily Yes
Kaggle Datasets
Methodspace Yes Yes Yes
Nautilus 3 sentence science Yes Yes Yes
Omeka.net Yes
Open Science Framework Yes
PubMed Commons Yes Yes
PubPeer Yes
ScienceBlogs Yes Yes
Scientopia Yes Yes Yes
SciLogs Yes
Silk Yes
SlideShare Yes Yes
SocialScienceSpace Yes
SSRN Yes
Story Collider Yes Yes Yes
The Conversation Yes Yes Yes
The Open Notebook Yes
United Academics
Wikipedia (& Wikimedia Commons) Yes
Zenodo

It is likely that portals with such technology in place will already be well indexed by search engines.

Next Steps


In searching for information sources to feed our hypothetical system, I discovered some sources of information that can be used as a scholarly footprint. I conducted an evaluation of the documentation for these systems, with an eye on what information they provide, but a more extensive evaluation of many of these systems is needed. Other portals, such as ResearchGate and Academia.edu, were not evaluated, but may be useful data sources as well. How often do scholars put useful data in their Twitter, Facebook, or other social media profiles? Also, what can be done to remove human review from the process of generating and verifying a global scholar profile?

Some portals have multiple options when it comes to determining if a scholar has posted work there. Many have local portal search engines that we can use, but I anecdotally noticed that some local search engines are more precise than others when it comes to their results. Within the context of finding the work of a given scholar, a review of the precision and recall of the search engines on these portals might help determine if a web search engine is a better choice than the local search engine for a given portal.

Open access journals, such as PLOS, have requirements that authors have posted data online, in sites such as Figshare and Open Science Framework. If we know that a scholar published in an open access journal, can we search one of their journal articles for links to these datasets and hence find the scholar's profile pages on sites such as Figshare and Open Science Framework?

Much like the APIs used for the global scholar profile, the APIs for each portal will need to be evaluated for precision and usefulness to our system. Some provide the ability to search for users, but others only provide the ability to find the work of a user if the scholar's user ID is already known.

Preliminary work using web search engines shows promise, but may also require a study to determine how to most effectively build queries that discover scholars on given portals. Such a study would also need to determine the ideal number of search engine results to review before our system stops trying to find the scholar on a portal using this method.

I evaluated 36 portals to determine if they contained robots.txt and sitemaps to help search engines crawl them. If a portal utilizes these items, do they have a good ranking for our search engine queries when trying to find a scholar by name and portal? How many of the portals lacking these items have poor web search engine ranking?

Ahmed, Low, Aly, and Josifovski studied the dynamic nature of user profiles, modeling how user interests change over time. Gueye, Abdessalem, and Naacke attempted to account for these changes when building recommendation systems. With this evidence that user information changes, how often does information about a scholar change? Scholars publish new work and upload new data. In some cases, such as Figshare, a scholar may post a dataset and never return, but other sites, like United Academics, may feature frequent updates. How often should our hypothetical system refresh its information about scholars?

Our case of disambiguating scholars is a subset of the larger problem of entity resolution as a whole: are the records for two items referring to the same item? Lise Getoor and Ashwin Machanavajjhala provide an excellent tutorial on the larger problem of entity resolution.  They note that the problem has become even more important to solve with the web providing multiple information from heterogeneous sources. Their summary mentions that different matching techniques work better for comparing certain fields, such as the idea that similarity measures like Jaccard and Cosine similarity work well for text and keywords, but not necessarily for names, where Jaro-Winkler has better performance. In addition, they cover the use of machine learning and crowdsourcing as ways to augment the simple matching of field contents to one another. Which parts of the global scholar profile are useful for disambiguation/entity resolution? In addition to the scholar, will other entities in the profile need to be resolved as well? What matching techniques are most accurate for each part of the global scholar profile?

Buccafurri, Lax, Nocera, and Ursino attempted to solve the problem of connecting users across social networks, a concept they referred to as "social internetworking". Recognizing that the detection of the same account on different networks is related to the concept of link prediction, they offer an algorithm that takes into account the similarity between user names and the similarity of common neighbors. How many scholarly portals can make use of a social graph information?

Lops, Gemmis, Semeraro, Musto, Narducci, and Bux build profiles for use in recommender systems with specific focus on solving two problems. The first is polysemy, where a term can have multiple meanings. Next is synonymy, where many terms have the same meaning. Their work focused on associating tags with users and constructing recommendations based on the terms encountered. For our system, polysemy will need to be investigated because the same term will mean different things in different disciplines (e.g., port means something different to a computer hardware engineer than a network engineer). Our system may become even more confused if presented terms from interdisciplinary scholars. However, unlike recommender systems, the system will use more than just terms for disambiguation, relying on other, less ambiguous data like affiliations and email addresses. Thus are polysemy and synonymy issues that our system needs to resolve? With which parts of the global scholar profile (i.e., fields) are they needed?

For the fields we have chosen to identify scholars, which matching techniques are most accurate? As noted before, some algorithms work better for names and userids than for keywords. What algorithms, including machine learning, network analysis, and probabilistic soft logic might best match some of our fields? Do they vary between scholarly portals?

Not all scholars may want to have their online work discovered in this way. What techniques can be employed to allow some scholars to opt-out?

Summary


Searching for scholars is slightly easier than searching for other online identities because, by their very nature, scholars produce content that can be useful for disambiguation. In this article, I reviewed sources of data that can be used by a hypothetical automated system seeking to identify the websites to which scholars have posted content. I provided a listing of different services that might help one build a global scholar profile that can be further used to disambiguate scholars online.

In order to discover where scholars post items online, I looked at scholar homepages, social media profiles, and the services offered by the portals themselves. After sampling homepages from Microsoft Academic Search API I found that 28.2% of the homepages contained links to websites on Kramer and Boseman's list of 400+ Tools and innovations.

I reviewed capabilities of 38 scholarly portals themselves, discovering that, in some cases, local portal search engines can be used to locate content for a scholar. Any automated tool would need to scrape these results, and getting to a scholar's information on the site falls into one of three patterns. To find an alternative to scraping, I also discovered APIs for a number of portals, but could only determine if scholars can be searched for on 5 of them.

To augment or replace the searching capabilities of scholarly portals, I examined the capabilities of search engine APIs and discovered the cost associated with each. As a few existing research projects looking for scholarly information (e.g., EgoSystem) make use of search engine APIs, I wanted to shows that this was still a viable option.

So, sources do exist for building the global scholar profile and methods exist at known scholarly portals to find the works of scholars at each portal. Evaluating solutions for disambiguation will be the next key step to finding their footprints in the portals in which they tread.

--Shawn