Unpaywall Change Notes
Created by: Richard Orr
Modified on: Tue, 11 May, 2021 at 12:29 PM
This page will summarize important changes to our methodology and data sources that we expect to significantly affect the Unpaywall dataset.
2021-02-02: Removed duplicate oa_locations in cases where the same publisher page was determined to have an OA copy in two different ways. Previously two oa_locations representing the same page could be created with different licenses and oa_dates.
Updated our list of "detected OA" journals, as described in this FAQ
, for 2021. Added about 200,000 Gold articles from 9,000 journals.
Began counting journals using "publisher's own license" in DOAJ as Gold OA. See https://doaj.org/toc/1930-2126
, for example.
Added about 100,000 Gold articles.
2020-12-31: Changed the version property of preprint locations from "publishedVersion" to "submittedVerson". This affects the preprints we reclassified as Green OA on 2020-05-01. We previously called these published because the preprint is often the final version, but this conflicts with the common expectation that accepted and published versions are peer-reviewed.
property to oa_locations
, and first_oa_location
to DOI records. See What is an OA license?
2020-09-14 - improved detection of Wiley Bronze OA
We improved our Bronze OA validation process for Wiley, which will convert about 1 million Closed or Green articles to Bronze OA over the next few weeks.
2020-05-01 - reclassified items on preprint servers as Green OA
We've reclassified articles hosted on preprint servers to reflect their differences from traditional publishing platforms. Examples of this type of platform are bioRxiv
, MDPI Preprints
, and ChemRxiv
As described in What do the types of oa_status mean?
, an article is Green OA if the host_type
of its best location is "repository". Until now, the URL resolved by an article's persistent DOI URL was always considered to have host_type
"publisher", and thus to be either Bronze, Hybrid, or Gold. Now, these locations are considered repositories and the articles are Green. At the time of this writing 170,000 articles are affected by this change.
2020-02-25 - began retroactively applying Crossref metadata updates:
We improved our Crossref data collection so that the latest article metadata is always reflected in Unpaywall, and we're retroactively applying Crossref updates from the last six months. This will affect the data feed for about 15 million articles and will produce larger-than-usual files between 2020-03-05 and 2020-03-19. We expect these files to contain about 8 million lines. The majority of these changes are revisions to published_date, publisher, and genre and do not affect open_locations or oa_status.
2019-12-08 - added articles from Semantic Scholar:
We're adding about 8 million PDFs hosted by Semantic Scholar.
We already have OA locations for many of these articles, but we expect this to create 3 million new Green OA articles by the end of 2019.
2019-11-14 - improved PDF validation:
Our automated PDF validation processes are now much more robust, allowing us to add about 1.5 million new OA articles. Half of these are in newly-identified Gold OA journals
that we were previously unable to spot because these articles looked unavailable to us.
Richard is the author of this solution article. Did you find it helpful? Yes No