Disambiguating author names in scientific literature is a major challenge. A new AI model from Semantic Scholar, S2AND, is designed to help.
S2AND: An Improved Author Disambiguation System for Semantic Scholar
by Daniel King & Sergey Feldman
Daniel King is an Applied Research Scientist at AI2 in Seattle, focused on natural language processing, information extraction, and improving Semantic Scholar. Follow Daniel on Twitter. Sergey Feldman is a Senior Applied Research Scientist at AI2 in Seattle, focused on natural language processing and machine learning. Follow Sergey on Twitter.
Introduction to Author Disambiguation
is a free, AI-powered research tool for scientific literature, based at the Allen Institute for AI. The single largest category of complaints and comments (32% of all customer service tickets in 2020) we receive are for author name disambiguation
(AND). The challenge of AND is one of identity: given a set of all papers, who wrote which ones?
For example, here are three papers by people named Daniel King:
ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing by Mark Neumann, Daniel King, Iz Beltagy, Waleed Ammar
Fetch & Freight: Standard Platforms for Service Robot Applications by M. Wise, Michael Ferguson, Daniel King, Eric Diehr, David Dymesich
Pretrained Language Models for Sequential Sentence Classification by Arman Cohan, Iz Beltagy, Daniel King, Bhavana Dalvi, Daniel S. Weld
The goal of AND is to figure out which Daniel Kings are the same and which ones are different (in this example, the first and third are the same person and the second is different). Sometimes, we have excellent metadata that makes this trivial: ORCIDs and email addresses in particular. But more often than not we have to make educated, model-based guesses. And instead of three papers, we have close to 200 million, and need to cluster every author of every paper.
This is a hard machine learning task. Why? Because there are many distinct people with identical names (homonymity), and plenty of names that look different in various papers but are actually all the same person (synonymity). For example:
Daniel King has a middle initial but only in his first paper: Daniel X. King. (Synonymity)
Daniel King got married and changed his name to Daniel Sting, so all papers after 2019 are under that name. (Synonymity)
Daniel King sometimes publishes under Dan King. (Synonymity)
Daniel King is Russian, publishes mostly in Russian journals, and the transliterations into English are varied: Danil, Danilo, Daneel, etc. (Synonymity)
Semantic Scholar only obtained or extracted the first initial for a few of Daniel’s papers, so all we know is that these papers are by D. King. (Synonymity)
There are two Daniel Kings who both publish about scientific NLP and one who publishes in linguistics journals. (Homonymity)
The preponderance of synonymity examples might make you think that it’s a more important problem, but that’s not the case. Homonymity is more challenging in practice.
paper metadata is not consistently available,
automated PDF metadata extraction systems make mistakes or don’t extract complete outputs, and
there are additional complexities for authors who publish in multiple languages.
In this blog post, we will dive deep into our new, improved AND model that we’ve named S2AND (pronounced “stand”). S2AND is actually a bundle of:
The semanticscholar.org production AND system which reduces errors by about 50% over the previous Semantic Scholar production system.
Before we go into the technical details, we’d like to first acknowledge the many folks that were critical to this work:
interned at Semantic Scholar in 2020 and labored carefully for many hours to curate all datasets and the first baselines.
was a mentor, advisor, and head writer on the paper.
The AND Pipeline
There is a fairly standard pipeline in the AND literature and S2AND is structured similarly. To understand the pipeline, we have to define the concept of an authorrecord (also called a signature in the S2AND repository): a record is composed of an author name string assigned to a given paper. The total number of records in Semantic Scholar at the time of writing is 515,369,250.
The records are bucketed into blocks. Usually, blocks are all records that have the same first initial and the same last name. Sergey Feldman, for example, ends up in the “S Feldman” block, along with Sapienza K. Feldman and S. L. Feldman.
A pairwise classifier is trained on pairs of records (each of which is a paper and a single author). This model is optimized to correctly guess whether two papers were written by the same person.
The pairwise classifier is applied to all pairs within each block. We end up with an N x N symmetric similarity matrix, where N is the size of the block. (Our largest block is over 400k records.)
Agglomerative clustering with average linkage is applied to each block using the results of step 3 as input.
The pairwise model can be whatever kind of machine learner you prefer: it’s standard binary classification. Agglomerative clustering is nice because you don’t have to specify the number of clusters, but only a cutoff parameter that controls how clusters are formed with respect to the pairwise scores and can be found by cross-validation, assuming you have training data with complete cluster labels. One of our main contributions to the AND literature is such a dataset, which is composed of eight existing AND datasets that have all been linked to Semantic Scholar papers and unified under a common format with a full complement of metadata. For more details about the dataset, see our paper here
Challenges and Solutions
Inevitably, the models you make don’t do what you want, and you have to fix them. In this section, we’ll go over a series of challenges that we encountered during the model development process, and our eventual solutions. Some solutions were partial, some cosmetic, and some we aren’t sure were worth the effort. We’ll also show a table of metrics so you have a sense of how important each aspect of the algorithm is.
With complete information about each publication, the author disambiguation problem is not difficult. For example, if we always know the email and affiliation of an author, we only need to solve cases where someone changes their affiliation or email, since the combination of these two items is very
likely to be unique. However, since Semantic Scholar contains real-life data, we are often missing many pieces of metadata. The only completely available pieces of information are paper title and last name, and even these can be mis-extracted. The first way we address this problem is by using the fantastic LightGBM
package, which natively handles missing values. We could have instead (a) used imputation methods such as MICE
or (b) used neural networks with specialized machinery for missing values
. Both have downsides: (a) means having two stages to optimize instead of a single-stage, and (b) means twiddling with neural networks for an extra month or two. LightGBM is battle-tested and has worked well for us before
During development, we noticed that some missing values seemed to bias the model prediction in strange ways. In one example we noticed that a missing abstract made the model more likely to predict that two papers were by the same author. We traced a couple of these issues back to spurious correlations between specific pieces of metadata being missing and the pairwise label (same author or not). To address this issue, we performed simple data augmentation, randomly knocking out pieces of metadata from a portion of examples in the training set (e.g. removing the author’s affiliation 5% of the time).
One important way to tell if two papers are by the same author is to look at their topicality. Our previous AND algorithm used an LDA model for this, but for S2AND we switched to our own SPECTER
, a high-quality deep learning approach to embed titles and abstracts (if available). SPECTER is an indispensable component of S2AND: one of the most important features (according to a SHAP value ranking) was the cosine similarity between SPECTER embeddings of each pair of papers.
We also have less fuzzy topicality signals: character and word n-gram Jaccard similarity between titles, but these features are much less important according to the SHAP ranking. We had a similar set of ngram-based features for similarity between abstracts. These were removed from the model entirely without any loss of quality.
The Foolishness of Machine Learning
Despite our best efforts, the model is not always correct (surprise!). This is expected, and acceptable, but some mistakes are worse than others. Mistakes that are harder for users to understand and/or correct are more costly, and we would like to always prevent them. For example, we have a rule that prevents two author records with completely different first names from merging together (e.g. Daniel and David). This is wrong sometimes (people change their names, we have extraction errors, etc), but is mostly correct. The cost of not having the rule would be that sometimes the model would jumble up two people with very different names. This can be time-consuming to correct, especially if there are many papers in the profile. With the rule (assuming an otherwise perfect model), this case is fairly straightforward to correct, by simply merging the two profiles together (e.g. Daniel King and David King).
We also implement a rule that prevents papers in different languages from clustering together. Again, this is wrong because people can publish papers in multiple languages, but the reasoning is similar in that we prefer the sensible correction of merging an (e.g.) English profile with a Spanish profile, over the correction of removing erroneous Spanish papers from an English profile. Additionally, some features of our model are designed to work with English text (e.g. SPECTER embeddings), and so we don’t think the current model is likely to do reasonable things with non-English papers. In general, better handling of non-English papers on Semantic Scholar is an item that we have not yet tackled.
Over-reliance on Name Surface Forms
Early on during model iteration, we noticed that the model was overly reliant on exact name matches, meaning that it would take an exact name match (or lack thereof) as more important than other, possibly contradictory factors, like affiliation match or coauthor match. Why might this be the case? We don’t know, but here are some hypotheses:
The training data is somehow easier than the data in the wild.
Name features are the most predictive but don’t capture the interesting part of the decision boundary, and the model overfits to the surface form features because the interesting part of the decision boundary is harder.
We have too many name features.
We took an ensemble approach to this problem and trained a second pairwise classifier that does not have access to any of the features related to the surface form of the name. To compute the final pairwise score, we averaged the score of this “nameless” model with the score of the full model, and this provided a large performance boost in our ablation experiments.
Because S2AND is based on individual pairwise decisions, it is essentially a local algorithm and lacks global information. One important piece of global information is how common a name is. In other words: the base probability that two people are the same person given their name. For example, a pair of “Jane Smith” papers are less likely to be by the same person than a pair of papers by “Grumsfeld Greco”. The more common a name is, the lower the probability of “same author” should be. To give our models the ability to figure this out, we provide name count features aggregated over the entire Semantic Scholar dataset (190M+ papers).
For example, let’s say we are trying to predict if two papers by “Lia Long” and “Liana Long” are by the same author or not. The model would be given access to the following information:
# of times the first names “Lia” and “Liana” appear in author lists across all papers in our corpus.
# of times the last name “Long” appears in author lists across all papers.
# of times the first initial and last name “L Long” appears in author lists across all papers.
# of times the first and last names “Lia Long” and “Liana Long” appear in author lists across all papers.
In order not to overfit, we didn’t include any entries in the count dictionaries where count = 1. These dictionaries are being released publicly as well.
These turn out to be key features — they are often top-ranked according to SHAP values.
Training data isn’t often like production data. It’s a top-5 headache when trying to put good ML models into production. What we had initially was a model that worked well on existing AND datasets, according to the B-cubed metric. So how do we build confidence that our model is actually good? We have multiple evaluation settings, requiring different levels of manual assessment, only performing the more involved evaluations after passing the less involved evaluations.
We consider that real data is “out-of-domain” in some sense, with respect to any of our training datasets. We know it is different, but don’t know exactly how. Similarly, each existing dataset is “out-of-domain” for all other training datasets. To partially simulate the transfer to real data, we evaluate our models in a “leave one dataset out” setting, meaning we train on each group of N-1 datasets, and evaluate on the single, held-out dataset. A model that does well on data that looks like training data is not that exciting. But a model that consistently does well on held-out datasets seems more likely to not crash and burn in production.
We pick a small number of profiles that we are able to assess somewhat easily (ourselves, people we know, western and non-western author names, emerging and established authors), and qualitatively assess model output on them. This allows us to verify a range of profiles of different difficulties and verify qualitatively what happens as we change features with specific goals in mind.
At Semantic Scholar, we allow users to correct their profiles, so we have substantial data from user corrections. This data is messy and hard to use for a number of reasons, but is still useful. Between model variants, we examined profiles with changes in the model’s accuracy according to the corrections. This gave us both a qualitative assessment (do we see a trend in the new mistakes/corrections the model makes?) and a quantitative assessment (is the new model dramatically better/worse according to the user corrections that we have?).
Individually, any of these assessments runs a risk of overfitting to the nuances of the particular evaluation setup. Combined, these assessments give us strong confidence in the model that we have produced.
Quantitative Ablation Evaluation
We’ll now report a user correction metric for many of the solutions discussed above. We used a total of 130 blocks that had corrections in them.
Our normal metric, B-cubed, can’t be used with user correction data because we don’t have 100% disambiguated ground truth data. We instead used a “min edit” distance, which is the smallest number of alterations that need to be made to a clustering to recover the user-corrected profile. Most of the time the answer is 0, so the mean is somewhere between 0 and 1. Smaller is better.
We also report the overall F1 score in terms of pairwise classification (larger is better). The final column shows the number of blocks that have a non-zero edit distance (smaller is better).
Name counts, SPECTER and the nameless model are all clearly useful. Name counts in particular provide gigantic performance gains.
In this evaluation, using a single training dataset instead of all eight gives us only slightly worse performance. This confirms our paper’s finding that a single dataset can get close to the performance of the full union if it’s the right one.
At first glance, it looks like using the augmentation dataset and rules might be hurting performance. This isn’t really true when you manually examine the differences between these models’ outputs — both augmentation and rules add stability, and tend to prevent errors that look egregious to end-users.
Challenges without Solutions
There are some problems with our AND solution that have no solutions (yet).
Recall that the first step of the AND pipeline is to pre-cluster papers into blocks such as all papers written by “S Feldman” and all papers written by “L Long”. This means that if two papers by the same author are in different blocks, we have no hope of putting them in the same cluster. There are three main reasons why the same person would appear in multiple blocks:
They changed their name (first or last).
Name extraction errors. Sometimes “Liana Long’’ will appear as “Long, Liana” and we’ll get it wrong. Or we’ll mis-extract “Sergey Feldman, Daniel King, Liana Long’’ as “Feldman Daniel’’ and “King Liana’’.
In some cultures the last name is written first. This may only happen on a subset of an author’s papers.
For (1), we rely on authors to tell us that they’ve changed their name and correct their profiles. Trying to guess a name change is very hard, and getting it wrong is too annoying to authors. A similar logic applies for (3).
For (2), Semantic Scholar researchers are working on a better PDF parser.
Transliteration Information Loss
One big challenge is that we normalize all names to ASCII characters. We do sometimes have names in their original script, but often they have been transliterated somewhere upstream. This causes extra difficulty for some languages. For example, the process of transliterating names in Chinese or Korean to ASCII is lossy, which makes disambiguation harder. Ideally, a disambiguation system would work with the native script of the name, but mostly we do not have that data.
What’s above is only a small fraction of all the ways we tried to improve the algorithm. Here are just a few of our failures.
Triangle Inequality Violations
The clustering step takes as input a matrix of similarities (or distances). Because this matrix is generated in a pairwise fashion, we end up with some inconsistencies. The simplest to think about is a violation of the triangle inequality. What does this mean in AND terms? Let’s say we have 3 records:
“Dogs and Cats: Genetic Similarities and Differences.” by Liana Long, et al.
“RNA and DNA: the Hats of Scarves of the Genome.” By L Long, et al
“MEGA-CRISPR: Pre-Define Your Children.” by Lorraina Long, et al.
Our pairwise model may predict something like this:
P(A and B are not the same author) = 0.20
P(A and C are not the same author) = 0.99 (because S2AND knows Lorraina and Liana are unlikely to be synonyms)
P(B and C are not the same author) = 0.30
Clearly this violates the triangle inequality: (B) can be with (A) or (C), but (A) and (C) cannot be together. We hypothesized that blocks with many triangle inequality violations were the source of some errors, and tried to fix them with a whole raft of techniques
. It didn’t work at all, but we did get better at using Cython.
More generally, a process that allows clustering decisions to influence other clustering decisions in complex ways should produce better clusterings, both by handling situations like the one above, and allowing partially available metadata (e.g. affiliation) to propagate through an entire cluster.
Guessing Agglomerative Clustering Hyperparameters
Agglomerative clustering has a single hyperparameter: eps, defined by the scikit-learn documentation
as “the linkage distance threshold above which clusters will not be merged.” We use a validation set to figure out a good “eps” value for the model. Based on some exploratory data analysis, we found that larger blocks tended to prefer different values of eps than smaller blocks. Based on this observation, we asked: can one guess
an eps for each test block?
We trained a regression model that tried to guess the optimal eps using metadata about the test block as input features, and saw no performance improvements.
Ensemble Instead of Union
Taking a union of the training data is not the only way to combine the datasets. We also tried training one model per dataset and ensembling the result. Ensembles are known to be generally powerful in situations where multiple different models are trained on the same dataset. Our situation is different: we are training the same model on different datasets and ensembling those. It did not work. We tried a number of techniques to ensemble multiple clusterings, but the union with a single clustering was superior. And it was faster as well due to only having to cluster a single time — this really makes a difference for the largest blocks we have (200k+ records) where a single run of fastcluster can take an hour and requires dozens of gigabytes of RAM.
Our new and improved author disambiguation model is live on Semantic Scholar! The S2AND model now covers 99%+ of author pages on Semantic Scholar, and we are working to complete the rollout in the coming weeks. Note that we haven’t made any changes to your profile if you’ve claimed it.
We believe S2AND is a significant improvement over the previous system, and encourage you to claim your page
if you haven’t already!
The problem is far from solved, and the latest advancements in NLP and graph ML may well provide further improvements, but we now have a strong baseline to compare against, an evaluation suite that we can use and easily add new datasets to, and a much better understanding of the problem.
More From Medium