Josh Nicholson discusses the evolution of a fundamental element of scholarly publishing, citations.
In 1964 Eugene Garfield published a short article entitled, “Can Citation Indexing Be Automated?” In it, he describes “citation markers,” terms or phrases that “would briefly describe the kind of relationship that exists between the citing and cited documents.” He gives a few examples, including “critique,” “conclusions wrong,” or my personal favorite, “calamity for mankind.” He suggests that an intelligent machine would generate such markers by analyzing full-text articles. Here I describe how Garfield’s vision has become a reality with the introduction of Smart Citations from scite.
Traditionally and even now, citations are primarily used as a metric, indicating an article’s impact on the scholarly literature by showing how many articles reference it. Citations are also used to assess the impact of researchers, the impact of research from universities, and the impact of research from journals. As a metric, citations, for better or worse, have enabled such entities to be ranked according to the proposition that more citations equal more impact, which equals better rankings. This simplistic view has dominated academic publishing and tenure and promotion decisions; it has also taken citations—which used to be primarily used for information retrieval—and transformed them into a measuring stick. Accordingly, many researchers and administrators have called for better ways to assess research and researchers other than citations or called for different ways of counting citations. However, unlike Garfield, most have not suggested a new type of citation system or new types of citations.
If one were to rethink citations today, there are various features or sources of information that we could think to include, beyond simply adding descriptive markers like Garfield suggests. First, the process would need to be automated given the sheer scale of publishing today. With that in mind, what could we show or what would Citations 2.0 look like? In addition to markers describing the relationship between articles, one could show the direct relationship itself, that is, the citation context. This information taken directly from the full text of the citing article would allow readers to see specifically how one article cites another and would allow readers to quickly see if a study has been supported or contrasted in the literature. Moreover, it would allow readers to see how peers have discussed a study, helping readers better understand the findings themselves. If one is to extract out the citation context from articles, it also makes sense to show where that context was extracted from. Because research papers have a general structure (IMRaD), one could additionally include the section where the citation was made and if a paper was cited multiple times from one paper to another in multiple sections. Using metadata one would also expect to easily see if the authors from the cited or citing article are shared, that is, is it a self-citation or not. In short, the next generation of citations should go beyond telling us which papers have cited an article and how many times it has been cited to how and why it has been cited.
Providing rich contextual information on citations has been attempted before. Publishers like PLoS prototyped rich citations, a project that was abandoned without ever getting much traction in the community. Mendeley also prototyped showing contextual information, but this never moved past the pilot phase either. More recent efforts by Semantic Scholar and independent research groups have gone further, showing citation sentences, often called “citances,” alongside the typical citation metadata. But these initiatives have been mostly academic explorations and again only been used as a proof of principle. In other disciplines, like law, rich citation information has been in use for many decades, enabling a process called Shepardizing.
Shepardizing, named after Frank Shephard and introduced over 100 years ago, allows lawyers and researchers to see if a case has been overturned, reaffirmed, or questioned. This system helps lawyers make sure they cite good law. As discussed in this interview, Garfield was aware of Shepardizing early on and this undoubtedly influenced his thinking on scientific citations. He was also aware that bringing such a system to a science citation index would be dramatically more complex because of how much larger the research publishing ecosystem is and because the way scientists cite articles is mainly unstructured (beyond the formatting style of citations). In law, authors are instructed to use specific words as “signals” to indicate their use as outlined by the Blue Book of citation styles. In addition to the difference in scope and complexity, there is also a critical difference between research and law. Law is fairly binary. The law is either law or not, often with some dissenting opinions. Research is much messier and often multiple hypotheses compete simultaneously to explain certain phenomena; there is no real arbiter of truth, except nature itself. Thus, while there are parallels between Shepardizing and what Garfield described, there are clear differences as well.
Scite, a company I co-founded and run today, builds off these previous efforts, and has introduced a new citation system called Smart Citations. Smart Citations are now live on millions of research articles from all disciplines and are used by researchers around the world. The introduction of Smart Citations is not simply a result of a new team executing where previous teams failed, although that is critical. The introduction of Smart Citations is due to the emergence of various changes in scholarly publishing, technology development, and the need for a better system.
Smart Citations, like traditional citations, show how many times an article has been cited and which papers cite it. Going beyond traditional citations, Smart Citations also indicate how and why an article was cited by displaying the surrounding textual context from citing articles, and where it was cited by showing which section(s) it was referenced in, and a classification indicating whether the claims from the cited paper were supported, mentioned, or contrasted (Figure 1).
To create Smart Citations, access to full-text articles is required in order to extract the citation context and citance from the paper. Thus, the transition of scholarly publishing from subscription-based content to open access has in part helped usher in the next generation of citations by allowing groups to first prototype and build new citation features without needing a convincing business use case. Indeed, the work described above by Semantic Scholar (S2ORC) and the Colil database were both built by using open access articles. The problem is that they also stop there, thus missing a lot of relevant content and articles. Scite too started with a subset of open articles, allowing the necessary space to explore and prototype, but since then has worked directly with publishers to index subscription-based articles, providing robust coverage of citations with more than 1 billion Smart Citations extracted from 30 million full-text articles.
Accessing full-text articles is, however, only the first step. Given most scholarly articles are still formatted as PDFs and there are over 8,000 citation styles and numerous PDF layouts, it is technically very challenging to extract information from research papers. Indeed, scite (and other tools) rely upon sophisticated machine learning approaches to extract citation information. Some of these tools and approaches have been years in the making, like GROBID, whereas other tools like SciBERT (used for the classification of citations in our system) have only been available for a few years. Again, this highlights the importance of timing in the creation of a new citation system. The ability to automatically process papers is thus feasible, and given advances in cloud computing, can be done efficiently and at scale. The full details of how Smart Citations are created can be found in our recent publication in Quantitative Science Studies.
Citations in their current form are used for various research tasks but arguably they are mostly used for bibliometrics research and administrative tasks, such as putting together a tenure and promotion packet, complying with government or ranking agency mandates or requests, or helping with journal promotion, subscription, or submission decisions. Citations are certainly used for discovery, allowing users to identify and sort impactful studies in academic search engines like Google Scholar and PubMed. But how citations are used on papers is quite superficial. For example, when you open a research paper you might look initially at a few key things: where was the research published, who are the authors, where do they come from, and some metrics like citations, altmetrics, or downloads and reads. Citations, in this case, are mostly used very superficially, they are typically glanced at and used to make some sort of snap judgment on quality and impact. Citation lists are not systematically opened to see how the study has been interpreted, discussed, critiqued, or more generally how the study has been cited in subsequent research. Why? Because seeing how an article has been cited, not just how many times, is so time-consuming that it is effectively never done. Consider a paper with 50 citations. If you want to see how this paper has been cited, you need to open 50 separate papers, find the in-text citation, and interpret it. That is hours of work, if not days of your time.
With the context readily visible one can quickly see how a paper has been cited and this can help in better understanding how it has been received and if the claims have been subsequently tested by others. This data and this way of looking at citations can be applied to journals, researchers, affiliations, and funders, allowing “impact” to be more easily assessed and providing more nuance than citations equals good.
In addition to expanding previous use cases with more rich information, Smart Citations can also open up new use cases. For example, searching Smart Citations allows one to see directly what experts say about nearly any topic returning the information from the source, not just the metadata of the article. This turns citations into a conversation that one can search to find expert insight, critiques, opinions, and data tied directly to a source article backed by data and analyses. Additionally, this lets researchers see how specific topics, reagents, or really anything has been cited in the literature. You can search an antibody to quickly find out how others have used that antibody in their paper. Because the research landscape is so diverse you can find relevant information on Peppa Pig or paliperidone. Smart Citations don’t just allow you to find relevant papers, they allow you to find relevant information!
With more and more journals adopting Smart Citations (over 3 million articles display Smart Citations, including those published by the National Academy of Sciences, Royal Society, Wiley, and others) and with more and more full-text papers being indexed, what’s next? Assuming all information has been extracted from citations, the next step in my opinion is to use that information to “prime the pump” for humans to contribute more. That is, there is a wealth of tacit information researchers hold, but how do we get them to contribute this into the scholarly record so it can be searched and discovered outside of one-on-one conversations? Following Garfield, I predict that annotations to citations in the form of short notes, paragraphs, or single figure observations will soon supplement paper-to-paper citations. Already we see very valuable commentary on social media but this is not appropriately tied to the publication record, it is not citable, and it is not properly preserved; moreover, it is a small fraction of what could be and what occurs over the water cooler, in journal clubs, and so forth. Capturing this tacit information or “taCITATION” will in my view be the next next generation of citations.
JN is Co-founder and CEO of scite.