Sign up
Get started with CORD-19
Apache-2.0 license
114 stars
16 forks
Go to file
The COVID-19 Open Research Dataset (CORD-19)
CORD-19 is a corpus of academic papers about COVID-19 and related coronavirus research. It's curated and maintained by the Semantic Scholar team at the Allen Institute for AI to support text mining and NLP research. Please read our paper for an in-depth description of how it was created:
The final version of CORD-19 was released on June 2, 2022. Since we launched the dataset on March 13, 2020, we have released an updated version of the dataset almost every week. Starting from around 40K articles in its first version, the dataset has grown to index over 1M papers, and includes full text content for nearly 370K papers. We thank you for your support and feedback throughout this process. For more information, please see this blog post. A list of alternate data resources are provided under Other resources.
Important notes
We have performed some data cleaning that is sufficient to fuel most text mining & NLP research efforts. But we do not intend to provide sufficient cleaning for this data to be usable for directly consuming (reading) papers about COVID-19 or coronaviruses. There will always be some amount of error, which will make CORD-19 more/less usable for certain applications than others. We leave it up to the user to make this determination, though please feel free to consult us for recommendations.
While CORD-19 was initially released on 2020-03-13, the current schema is defined base on an update on 2020-05-26. Older versions of CORD-19 will not necessarily adhere to exactly the schema defined in this README. Please reach out for help on this if working with old CORD-19 versions.
All versions of CORD-19 can be found HERE.
First published version (2020-03-13): Download Link (size: 0.3Gb, md5: a36fe181, sha1: 8fbea927)
Last published version (2022-06-02): Download Link (size: 18.7Gb, md5: c557069e, sha1: dd2c32bc)
Dataset Versions Used for TREC-COVID Shared Task
TREC-COVID Shared Task Website:
TREC-COVIDDateChangelogLink to downloadmd5sha1
Round 12020-04-10linkcord-19_2020-04-10.tar.gz (1.5GB)f4c3e7424980d8ee
Round 22020-05-01linkcord-19_2020-05-01.tar.gz (1.7GB)e8c56920dc22dbc9
Round 32020-05-19linkcord-19_2020-05-19.tar.gz (2.8GB)6424de9c1781b935
Round 42020-06-19linkcord-19_2020-06-19.tar.gz (3.3GB)47b61215fdd0490e
Round 52020-07-16linkcord-19_2020-07-16.tar.gz (3.7GB)018c4bc47adcf31a
Dataset Versions Used for EPIC-QA Shared Task
EPIC-QA Shared Task Website:
EPIC-QADateChangelogLink to downloadmd5sha1
Preliminary round2020-06-19linkcord-19_2020-06-19.tar.gz (3.3GB)47b61215fdd0490e
Primary round2020-10-22linkcord-19_2020-10-22.tar.gz (5.3GB)7cb9e7437efe285f
CORD-19 is released weekly. Each version of the corpus is tagged with a datestamp (e.g. 2020-05-26). Releases look like:
|-- 2020-05-26/ |-- changelog |-- cord_19_embeddings.tar.gz |-- document_parses.tar.gz |-- metadata.csv |-- 2020-05-27/ |-- ...
The files in each version are:
When cord_19_embeddings.tar.gz is uncompressed, it is a 769-column CSV file, where the first column is the cord_uid and the remaining columns correspond to a 768-dimensional document embedding. For example:
When document_parses.tar.gz is uncompressed, it is a directory:
|-- document_parses/ |-- pdf_json/ |-- 80013c44d7d2d3949096511ad6fa424a2c740813.json |-- bfe20b3580e7c539c16ce4b1e424caf917d3be39.json |-- ... |-- pmc_json/ |-- PMC7096781.xml.json |-- PMC7118448.xml.json |-- ...
Example usage
We recommend everyone primarily use metadata.csv & augment data when needed with full text in document_parses/. For example, let's say we wanted to collect a bunch of Titles, Abstracts, and Introductions of papers. In Python, such a script might look like:
import csv import os import json from collections import defaultdict cord_uid_to_text = defaultdict(list) # open the file with open('metadata.csv') as f_in: reader = csv.DictReader(f_in) for row in reader: # access some metadata cord_uid = row['cord_uid'] title = row['title'] abstract = row['abstract'] authors = row['authors'].split('; ') # access the full text (if available) for Intro introduction = [] if row['pdf_json_files']: for json_path in row['pdf_json_files'].split('; '): with open(json_path) as f_json: full_text_dict = json.load(f_json) # grab introduction section from *some* version of the full text for paragraph_dict in full_text_dict['body_text']: paragraph_text = paragraph_dict['text'] section_name = paragraph_dict['section'] if 'intro' in section_name.lower(): introduction.append(paragraph_text) # stop searching other copies of full text if already got introduction if introduction: break # save for later usage cord_uid_to_text[cord_uid].append({ 'title': title, 'abstract': abstract, 'introduction': introduction })
metadata.csv overview
We recommend everyone work with metadata.csv as the starting point. This file is comma-separated with the following columns:
Questions about CORD-19
Why can the same cord_uid appear in multiple rows?
This is a very tricky issue, and we have not decided on the best way forward. To explain, let’s take example cord_uid=hox2xwjg. Examining their respective rows in the metadata file, we see that they are the same paper, but sent from different sources (Elsevier, PMC). The Elsevier row has DOI and PDF, but the PMC row doesn’t. Furthermore, the PMC ID, publication date, and URL for each of these rows is different.
Technically all of this data is representative of paper hox2xwjg so we don’t want to remove any of it. But combining them into one cluster would require a schema change to the data, which would break a lot of people’s code. Hopefully this is not too big an issue because there are only a small percentage of papers affected, but know that this issue exists and we’re debating what’s the best way forward.
Why do the PMC JSONs not contain any abstracts, yet the PDF JSONs contain abstracts?
Abstracts in the metadata.csv file are “gold” provided directly from publishers or digital archives. Because PMC is very consistent at providing us “gold” abstracts, we do not bother with parsing the PMC XMLs for abstract text (it’s already in the metadata.csv). As such, the PMC JSONs do not contain abstracts. This is not the case for PDF JSONs. We often obtain PDFs through crawling, and in this manner, we would not have “gold” abstracts provided to us. As such, we still opt to parse the PDF for abstract text, which is why that field exists.
Why do the title/authors in the JSON look different from what’s in the metadata file?
The most likely reason is PDF parsing errors. Occasionally, publishers will have different metadata from what is actually displayed on the PDF itself (e.g. slight differences in author names). We encourage users to use fields in the metadata file by default and only fall back on the JSON when it is missing.
Why is the JSON missing certain metadata, like publication dates?
The JSONs are only meant for representing the full text of the PDF in a structured, machine-readable format. Many metadata fields like dates and venues don’t commonly appear on the PDF. Please defer to the metadata file for all such fields, since these come from the publishers directly.
How do you handle paper objects like tables, figures, equations?
Many papers in CORD-19 include HTML table parses. These table parses are available in the document parse files under ref_entries of type table. Note: not all tables will have HTML parses. These parses leverage IBM Watson Discovery capabilities (more details can be found in our paper).
Figure images are currently not available. We’re currently looking into how to best support these. As for equations, we do not do anything special here – the symbols are treated as text and should be included in the text blobs.
What should we do if both PDF and PMC JSONs exist? Or if there are multiple PDF JSONs?
We view these as different attempts/views to represent the same paper/document. Some are going to be higher quality than others. Treat these are separate representations of the same document – you can choose to use one, both, neither (i.e. just use the metadata fields). On average, we believe the PMC JSONs are cleaner than the PDF JSONs but that’s not necessarily true.
Why can the same sha appear for different cord_uid?
Let’s take a look at examples cord_uid=d9v5xtx7 and cord_uid=8avkjc84. They both share PDF sha=5d0d0bd116976e1412c10a84902894999df4a342​. These are two papers we sourced from Elsevier. If you follow the URLs, you’ll notice that they actually retrieve the same PDF despite different having different DOIs. This is an upstream error from the publisher, which we can’t necessarily do anything about. Hopefully the number of these cases is small.
Mailing list
Subscribe to notifications about CORD-19 at:
Please email and for any questions or concerns.
Citing CORD-19
Our paper was accepted to the NLP-COVID workshop at ACL 2020. See the reviews on OpenReview:​. The paper is available in the ACL Anthology (BibTeX below):
@inproceedings{wang-etal-2020-cord, title = "{CORD-19}: The {COVID-19} Open Research Dataset", author = "Wang, Lucy Lu and Lo, Kyle and Chandrasekhar, Yoganand and Reas, Russell and Yang, Jiangjiang and Burdick, Doug and Eide, Darrin and Funk, Kathryn and Katsis, Yannis and Kinney, Rodney Michael and Li, Yunyao and Liu, Ziyang and Merrill, William and Mooney, Paul and Murdick, Dewey A. and Rishi, Devvret and Sheehan, Jerry and Shen, Zhihong and Stilson, Brandon and Wade, Alex D. and Wang, Kuansan and Wang, Nancy Xin Ru and Wilhelm, Christopher and Xie, Boya and Raymond, Douglas M. and Weld, Daniel S. and Etzioni, Oren and Kohlmeier, Sebastian", booktitle = "Proceedings of the 1st Workshop on {NLP} for {COVID-19} at {ACL} 2020", month = jul, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "" }
Projects using CORD-19
This is a Google Sheet tracking systems and demos that use CORD-19. Projects are listed in random order. Our focus here is to collect community efforts that might not be discoverable because systems and demos don't always translate to papers (which we can find via citations of CORD-19).
Missing yours or incomplete data? Let us know using this Google Form or email us!
Other resources
S2ORC-doc2json: We use this library to process PDFs and PubMed JATS XML into the format released in CORD-19. This library can be adapted to produce your own versions of the dataset. Source code and instructions for using the library can be found here.
Semantic Scholar API: Metadata, paper abstracts, and citation information for papers we index are available through our API. Documentation here.
S2ORC: A dataset of millions of full text papers processed in the same way as CORD-19, but covering many different fields of science. Not regularly updated; intended for offline research, like model development. Available here.
PubMed Central: The National Library of Medicine (NLM) continues to collaborate with publishers to make COVID-19 and coronavirus-related publications and associated data immediately accessible in PubMed Central (PMC) in human- and machine-readable forms. Available here.
LitCovid: NLM continues to update its LitCovid dataset of COVID-19 related publications to facilitate text mining. Available here.
No releases published
No packages published
Contributors 3
kyleclo Kyle Lo
lucylw Lucy Lu Wang
alexwade Alex Wade
© 2022 GitHub, Inc.
Contact GitHub
Actions Automate any workflow Packages Host and manage packages Security Find and fix vulnerabilities Codespaces Instant dev environments Copilot Write better code with AI Code review Manage code changes Issues Plan and track work Discussions Collaborate outside of code All features Documentation GitHub Skills Blog Enterprise Teams Startups Education CI/CD & Automation DevOps DevSecOps Customer Stories Resources GitHub Sponsors Fund open source developers The ReadME Project GitHub community articles Topics Trending Collections Pricing
CodeCodeIssuesIssues2Pull requestsPull requestsActionsActionsProjectsProjectsSecuritySecurityInsightsInsights Code Issues Pull requests Actions Projects Security Insights