Rich Citations: Open Data about the Network of Research

Why are citations just binary links? There’s a huge difference between the article you cite once in the introduction alongside 15 others, and the data set that you cite eight times in the methods and results sections, and once more in the conclusions for good measure. Yet both appear in the list of references with a single chunk of undifferentiated plain text, and they’re indistinguishable in citation databases — databases that are nearly all behind paywalls. So literature searches are needlessly difficult, and maps of that literature are incomplete.

To address this problem, we need a better form of academic reference. We need citations that carry detailed information about the citing paper, the cited object, and the relationship between the two. And these citations need to be in a format that both humans and computers can read, available under an open license for anyone to use.rich-citations

This is exactly what we’ve done here at PLOS. We’ve developed an enriched format for citations, called, appropriately enough, rich citations. Rich citations carry a host of information about the citing and cited entities (A and B, respectively), including:

  • Bibliographic information about A and B, including the full list of authors, titles, dates of publication, journal and publisher information, and unique identifiers (e.g. DOIs) for both;
  • The sections and locations in A where a citation to B appears;
  • The license under which B appears;
  • The CrossMark status of B (updated, retracted, etc);
  • How many times B is cited within A, and the context in which it is cited;
  • Whether A and B share any authors (self-citation);
  • Any additional works cited by A at the same location as B (i.e. citation groupings);
  • The data types of A and B (e.g. journal article, book, code, etc.).

As a demonstration of the power of this new citation format, we’ve built a new overlay for PLOS papers, which displays much more information about the references in our papers, and also makes it easier to navigate and search through them. Try it yourself here: http://alpha.richcitations.org.
The suite of open-source tools we’ve built make it easy to extract and display rich citations for any PLOS paper. The rich citation API is available now for interested developers at http://api.richcitations.org.

 
We’ve started collecting rich citations for all PLOS papers; currently, our database has over 10,000 PLOS papers, including nearly all PLOS Medicine papers. In a few weeks’ time, we’ll have indexed and extracted rich citations from the rest of the PLOS corpus. The ultimate goal is to collect rich citations for the entire scientific literature and provide it as open data for the research community. This kind of database would be a valuable resource not only in itself but also for the wide variety of applications that could be built using it. With a detailed database of the connections between scientific works, it would be much easier to trace the intellectual history of an idea or fact, and to see the true dependencies between different pieces of the scientific literature. We can also use this database to create better paper recommendation engines, helping readers find new and exciting work related to older work. Such software could also suggest additional works to read and cite while writing manuscripts. And a detailed map of the research literature would give us a more nuanced view of the relationships among published research than is currently available with traditional citation-based metrics. These are admittedly ambitious goals, but we have already started working with citation researchers and developers from outside of PLOS to make these ideas into a reality.

We hosted a hackathon here at PLOS this past weekend where we started to adapt our APIs to work with other publishers’ content. We also spoke with folks from the Wikidata project about the possibilities for other ways we can showcase this data and connect it to other resources like Wikipedia. Some computational researchers worked on better algorithms to detect whether two authors are the same person. And the bibliometric researchers we’ve spoken with about this project are keen to start playing with the rich citations dataset, which already contains over half a million distinct references.

We’re excited to roll out rich citations over the coming weeks. If you’ve got any suggestions for us, or if you’d like to hear more about it, please don’t hesitate to post in the comments, or to contact our Labs team directly at ploslabs@plos.org. We’d love to hear from you!

This entry was posted in Tech. Bookmark the permalink.

2 Responses to Rich Citations: Open Data about the Network of Research

  1. Jerroen Bosman says:

    Would it be possible to come up with a number of categories that does introduce a bit of sematics to rich citations. i.e. why is something cited: support, reject, example, further reading, datasource, vanity etc.? Of course with new papers, this could be done by the author. It will be much harder to do this retrospectively.

  2. Pingback: Rich Citations: PLOS Develops Enriched Citation Format, API Now Available For Developers | LJ INFOdocket

Add Comment Register



Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>