Introduction

The National Cancer Institute’s (NCI’s) Genomic Data Commons (GDC)1,2 currently contains NCI-generated data from some of the largest and most comprehensive cancer genomic datasets, including The Cancer Genome Atlas (TCGA, https://cancergenome.nih.gov/) and Therapeutically Applicable Research to Generate Effective Therapies (TARGET, https://ocg.cancer.gov/programs/target). Each of these projects contains a variety of processed and unprocessed molecular data types, including genomics, epigenomics, proteomics, imaging, clinical, and others.

These data, as well as data from future projects, are often generated using different methods, so that joint analysis of multiple datasets are often confounded by batch effects. Based on the lessons learned from TCGA, one of the major goals of the GDC is to create a uniform set of molecular datasets that minimize batch effects due to differences in reference genomes, gene models, analytical algorithms, and processing pipelines. In the GDC, this process is called harmonization and is broken up into two stages: alignment and higher level data generation. In the alignment stage, reads from Next Generation Sequencing (NGS) data are extracted and aligned, or re-aligned, to a single human reference genome sequence using a single pipeline for that particular data type. In the second stage, different higher level data generation pipelines utilize the GDC aligned data to derive summary results, such as somatic mutations or gene expression.

Data production started with processing of Whole Exome Sequencing (WXS) alignment and somatic mutation calling, Whole Genome Sequencing (WGS) alignment, mRNA-Seq alignment and gene/exon level quantification, miRNA-Seq alignment and quantification, TCGA Affymetrix Genome-Wide Human SNP Array 6.0 array copy number segmentation, and Illumina Infinium HumanMethylation450/27 methylation array re-annotation. By the end of May 2016, we had successfully processed more than 56,168 BAM files and FASTQ bundles, including 27,594 WXS, 4691 WGS, 11,969 RNA-Seq and 11,914 miRNA-Seq, from 12,487 patients with a total input file size of 1001.7 TB. From GDC harmonized BAMs, more than 100,000 higher level derived data files have been generated. A total of more than 1400 TB of generated data, including intermediate files, have been produced by GDC data processing pipelines, and stored in GDC storage. Many of them have been imported into the GDC database (described in a companion submission), and have become queryable and downloadable from the GDC Data Portal (https://portal.gdc.cancer.gov/) and API. Users are able to get the up-to-date file summary directly from data repository facet view.

In this work, we discuss the general considerations, implementation details and quality comparisons of a large-scale uniform genomics data analysis that was used by the NCI’s Genomic Data Commons for processing and harmonizing the cancer genomics data that it shares with its users.

Results

Reference genome and gene model

The GDC chose GRCh38 as the reference human genome build for all data analyses, because of its improved coverage and accuracy over the previous major build GRCh373. The GRCh38 major human genome assembly was released by Genome Reference Consortium (GRC) on Dec 2013 with GenBank assembly accession GCA_000001405.15. The complete assembly downloaded from NCBI contains 456 sequences, including 25 continuous chromosomal and mitochondrial sequences, 42 unlocalized scaffolds, 127 unplaced scaffolds, 261 alternative scaffolds, and 1 EBV decoy sequence.

At the time of the development of the GDC pipelines, there was a lack of tool support for alternative scaffolds analysis. As a result, the GDC excluded alternative contigs from the GDC reference sequence, and used GCA_000001405.15_GRCh38_no_alt_analysis_set from the NCBI ftp site as a modified GCA_000001405.15 reference contig set without alternative loci and patches.

In addition to the continuous chromosomal and mitochondrial sequences, unlocalized and unplaced scaffolds described above, the GDC reference sequence also includes 2385 human decoy sequences, together named hs38d1 with GenBank assembly accession GCA_000786075.2. These human decoy sequences are additional unlocalized sequences not officially recognized by GRC at the time of reference release, and having them in the GDC reference helps to reduce false alignments of such reads in other regions4.

Similar to the idea of improving alignment quality with human decoy sequence, we collected genomic sequences from ten types (200 subtypes) of cancer related viruses in the reference genome, together called “virus decoy” as a collection. The additional benefit of having such virus sequences in the reference genome is to give users the opportunity to directly identify virus genome derived reads. These viruses include Human Cytomegalovirus (CMV, HHV-5), Epstein-Barr virus (EBV, HHV-4), Hepatitis B virus (HBV), Hepatitis C virus (HCV), Human immunodeficiency virus (HIV), Human herpesvirus 8 (KSHV, HHV-8), Human T-lymphotropic virus 1 (HTLV-1), Merkel cell polyomavirus (MCV), Simian vacuolating virus 40 (SV40), and Human papillomavirus (HPV) (Table 1). All HPV sequences were obtained from The PapillomaVirus Episteme (PaVE, http://pave.niaid.nih.gov/) instead of NCBI, because this group updated the sequences as new information is confirmed5. The GDC reference including all human and viral decoys can be downloaded from the GDC at the following link (https://api.gdc.cancer.gov/data/62f23fad-0f24-43fb-8844-990d531947cf).

Table 1 Virus Sequences included in the GDC reference genomea.

For variant annotation and RNA-Seq alignment/quantification, we use Human GENCODE release version 22 as the default GRCh38 gene model5,6. This version contains 60,483 genes, including 19,814 protein-coding genes, 15,900 long non-coding RNA genes, 9894 small non-coding RNA genes, and many other types, such as pseudogenes. For miRNA-Seq annotation and qualifications, GDC uses miRBase version 217.

Reproducibility of analysis

All major GDC data production pipelines are written in the Common Workflow Language (CWL, https://www.commonwl.org/). In each workflow, a main CWL file describes how tools and sub-workflows, also written in CWL, can be used in clearly defined steps. All major tools have been containerized using Docker containers to support reproducibility and portability of the workflows. The GDC will be redistributing the main GDC workflows to the research community to support reproducible research.

DNA-Seq alignment, mutation calling, annotation, and somatic variant aggregation

The DNA-Seq alignment process includes initial alignment and a post-alignment optimization process. In the initial alignment step, reads are mapped to GRCh38 using BWA, followed by BAM sorting, merging of read group BAMs into a single BAM, and then duplicate-marking using Picard. When the read length of a read group is larger than 70 bps, BWA MEM8 is used for alignment; or otherwise, BWA Aln8,9 is used.

After initial alignment of WXS data, we follow GATK Best Practices (https://software.broadinstitute.org/gatk/best-practices/) to process all BAMs from the same patient together for a post-alignment optimization process called “co-cleaning” in which includes GATK IndelRealigner and BaseQualityScoreRecalibration (BQSR). IndelRealigner performs local realignment to further improve mapping quality acrossing all reads at loci close to indels, and BQSR detects and fixes systematic errors made by the sequencer when it estimates the quality score of each base call10,11,12. For post-alignment optimization of WGS data, we only employ BQSR, but not IndelRealigner.

GDC uses four somatic variant callers, MuSE13, MuTect212, VarScan214, and SomaticSniper15. MuSE and SomaticSniper only detect point mutations, while MuTect2 and VarScan2 can detect both point mutations and small insertions and deletions (INDELs). The GDC provides these point mutations and small INDELs together as Simple Nucleotide Variants (SNV). INDEL calls from VarScan2 were not included in the initial GDC data release, but are included since release 10.

Artificial chimera reads can form during the Multiple Displacement Amplification reaction16, such as the one used to generate the WGA libraries by REPLI-g. This phenomenon is most obvious in MuTect2 calls as we observed about 10 folds increase of INS/SNP ratio when tumor is a WGA sample (data not shown), compared to other cases in the same project, and most of such INDELs are primarily supported by soft-clipped reads. In order to increase specificity, we re-analyzed all TCGA tumor WGA samples with MuTect2 option –dontUseSoftClippedBases, and successfully removed most of these false-positive INDELs. However, as these artifacts were introduced during library preparation, they could also exist in MuTect2 SNPs, as well as variants called by other tools, in a smaller scale. We recommend users to consume these WGA calls with care.

Raw Somatic Variant Call Format files (VCFs) generated from each pipeline are further processed by caller-specific filters to tag low quality variants in the FILTER column in VCF. For SomaticSniper, variants with Somatic Score (SSC) < 25 are removed. These VCFs are then annotated using Variant Effect Predictor17 to generate Annotated Somatic VCFs.

To allow easier investigation of variants and annotations, the VCFs are transformed into project-level tab-delimited Mutation Annotation Format (MAF) files18. This is done with a custom tool based upon VCF2MAF (https://github.com/mskcc/vcf2maf) from Memorial Sloan Kettering Cancer Center. VCF and MAF files may contain germline variants and therefore all VCFs and MAFs described above are only available as controlled-access files. To access controlled-access files in the GDC, users must obtain approval from the Data Access Committee through dbGaP (database of genotypes and phenotypes)19.

We have also created open-access MAF files by applying stringent criteria to remove potential germline variants. These open-access MAF files are used to support variant visualization and simpler data sharing. We did not produce open-access MAFs for the TARGET program because of privacy concerns of child sample donors. Mutation loads of point mutations and INDELs from both open-access (public) and controlled-access (protected) MAFs of all TCGA projects are displayed in Fig. 1. Of note, this germline-masking process is so stringent that some real somatic variants, for example somatic variants in areas of low sequencing coverage in the paired normal samples, may have been removed from open-access MAF. We encourage users to explore controlled-access MAFs to view additional variants.

Fig. 1: Mutation loads of TCGA projects.
figure 1

GDC-detected somatic variants per sample are displayed by each pipeline (rows), and grouped in each project (columns). Combined counts of point mutations (SNP) and INDELs of either public MAF (dark blue) or protected MAF (light blue) are plotted in separate colors.

Quality assessment of GDC somatic variants

Somatic variant detection is still in a stage of rapid algorithm development, and no single caller is superior to others in every respect20,21,22,23. As shown in Fig. 2a, a direct comparison of SNV calls from different callers show significant overlaps, but also many tool-specific calls. Here, we only compared high quality variants, which means they do not have a non-PASS caller assigned FILTER value or GDC assigned GDC_FILTER values in the somatic MAF files of data release version 10. In summary, 56.0% of the clean variants have been identified by all four callers, 15.1% by three callers, 14.0% by two callers, and 14.9% called by only one caller. Among all four somatic callers, MuTect2 has detected the highest number of unique calls, while SomaticSniper has the least. Additional efforts are needed to examine the validity of such calls, especially those not called by all pipelines, in order to evaluate the performance of each tool, and hopefully lead to machine-learning algorithms that help to unify caller outputs. Please note that these results are only relevant to the GDC implementation of these pipelines and filtering strategy.

Fig. 2: Comparison of GDC somatic variant caller pipelines.
figure 2

The Venn Diagram on the left (a) shows the overlap among four GDC somatic callers. Among all clean variants, 56.0% have been identified by all four callers, 15.1% by three callers, 14.0% by two callers, and 14.9% by only one caller. The Venn Diagram on the right (b) shows recall rate of validated TCGA variants by GDC somatic callers. Among 115,476 TCGA validated variants collected, 3.2% are not recalled by any of the GDC pipelines; 1.2% are recalled by only one pipeline; 9.4% are recalled by two pipelines; 14.6% are recalled by three pipelines; and 71.6% are recalled by all four GDC pipelines.

Evaluation of somatic variant callers often requires comparison with a so-called gold standard dataset that has been extensively sequenced using multiple independent methods23, or using a simulated dataset. In a previous effort to evaluate quality of somatic variant callers, TCGA re-sequenced regions of many called variants using orthogonal sequence technologies, such as Sanger Sequencing, and therefore have produced a set of validated variants that GDC can use to evaluate False-Negatives in current GDC callers. However, no validation experiments were designed for GDC-called variants, thus False-Positives or specificity will not be evaluated in this paper. An independent analysis has been performed by comparing GDC-generated somatic variants to variants previously identified by the TCGA Consortium the same samples, and concluded that both datasets are highly concordant24.

We extracted and lifted over all TCGA validation information from 196 MAFs available on May 2016 to GRCh38 coordinates, and selected mutation calls from the same tumor samples that GDC has also successfully analyzed with all four pipelines. This results in 1,911 unique tumor samples and 115,476 validated variants, across 13 TCGA projects (BLCA, BRCA, CESC, COAD, GBM, KIRC, LAML, OV, PAAD, READ, SARC, THYM, and UCS). Comparison to GDC called somatic variants of the sample tumor sample shows that only 3.2% of TCGA validated variants are not re-discovered by any of the GDC pipelines (Fig. 2b). This number further decreases to 1.6% if we only compare variants from exactly the same tumor and normal aliquot pairs. 95.6% of the validated variants are called by at least two GDC pipelines; 86.2% by at least three pipelines; and 71.6% called by all four GDC pipelines.

To further investigate the impact of cancer type or project to each caller, we analyzed the validated variant recall rate for each pipeline (Fig. 3). In most of the cancer types, MuTect2, MuSE, and VarScan2 show good performance, while SomaticSniper typically recalls less. In particular, SomaticSniper displays a very low average recall rate of <50% in PAAD (Pancreatic Adenocarcinoma). Interestingly, SomaticSniper has the best recall rate for LAML. The algorithm of SomaticSniper may have been designed to better tolerate high-level tumor contaminations in germline control samples, which often exists as infiltration of liquid tumor cells in skin or buccal swabs of LAML patients.

Fig. 3: Recall rate of TCGA validated variants by project.
figure 3

In both plots, n = 125/123/192/182/223/141/215/148/48/200/107/57 biologically independent samples for projects BLCA/BRCA/CESC/COAD/KIRC/LAML/OV/PAAD/READ/SARC/THYM/UCS, respectively. Top: Boxplots of recall rate of TCGA validated variants by 13 projects and four GDC somatic variant calling pipelines. Each dot represents a unique tumor sample. Projects are ordered by decreasing average recall rate from left to right. Bottom: Boxplots of recall rate of TCGA validated variants by number of pipelines combined.

RNA-Seq data processing and quality assessment

RNA-Seq analysis by the GDC makes use of a modified workflow (https://github.com/ucscCancer/icgc_rnaseq_align) created by the International Cancer Genome Consortium. Input RNA-Seq FASTQ files were aligned to GRCh38 using the STAR 2-pass method25, and quantified using HTSeq26 and DEXSeq27. The GDC uses GENCODE v22 as the default gene model for both mRNA-Seq alignment and quantification. DEXSeq exon-level quantification results have not yet been imported into GDC Data Portal at the time of publication.

Gene expression is measured using HTSeq. In this pipeline, only reads or read pairs that can be uniquely assigned to a gene are counted. HTSeq supports mapping of both stranded (either forward or reverse) and unstranded libraries; however, the read assignment is different between these two different modes resulting in an inability to compare across these library types. Because the GDC emphasizes data comparability across projects, we have run HTSeq on all projects as if they were unstranded libraries.

For convenience to the scientific community, the GDC also produces gene level quantification in the units of FPKM (Fragments Per Kilobase of transcript per Million mapped reads)28 and FPKM-UQ (Upper-Quartile Normalized FPKM)28,29. The definition of these units is described in the “Materials and Methods” section. Note that the denominators of such normalizations are read counts of all the protein coding genes, instead of all genes. If users are interested in a different set of genes, they are encouraged to perform a normalization based on the genes they are working on, or to use a more sophisticated method, such as DESeq30 or EdgeR31. In addition, GDC does not perform RNA expression batch corrections.

We also compared GDC FPKM-UQ expression data to the original TCGA upper-quartile normalized RSEM expression values using Spearman correlation. This comparison was performed over 10,243 shared aliquots and 18,038 shared genes in both dataset. The average correlation between the same sample from two datasets (Fig. 4. Top) is 0.944, and the majority of the samples have correlation higher than 0.90.

Fig. 4: Boxplots of Spearman correlation between GDC and TCGA mRNA expression.
figure 4

Top: Boxplots of Sample to Sample Correlation between GDC and TCGA by Project. n = 79/427/1202/309/45/328/48/171/170/546/89/603/321/145/525/423/573/550/86/265/182/186/548/105/265/472/404/156/564/121/199/56/80 biologically independent samples for projects ACC/BLCA/BRCA/CESC/CHOL/COAD/DLBC/ESCA/GBM/HNSC/KICH/KIRC/KIRP/LAML/LGG/LIHC/LUAD/LUSC/MESO/OV/PAAD/PCPG/PRAD/READ/SARC/SKCM/STAD/TGCT/THCA/THYM/UCEC/UCS/UVM, respectively. Bottom: Combined Boxplots and Density Plots of Gene to Gene Correlation between GDC and TCGA. All genes are categorized by four GDC groups (Q1–4) based on their average expression values. Mean and standard deviation of gene to gene Spearman’s correlations are calculated by these four groups.

We can also measure the relative expression of the same gene among different samples. The average correlation between the same gene from two datasets is 0.929. We suspected that most of the deviation is from sporadic low-level expressed genes. To address this concern, we further categorized genes into four quartile groups (Q1–Q4) based on their average expression values in the GDC, and then examined gene level correlations within each of these four groups with TCGA results (Fig. 4. Bottom). We observed an average correlation of 0.98 in high-level expressed Q3 and Q4 groups and much lower values in Q1 and Q2.

miRNA data processing and quality assessment

The GDC miRNA quantification analysis workflow is based on the profiling pipeline that was developed by the British Columbia Genome Sciences Centre32. After realignment of miRNA-Seq reads to GRCh38 using BWA Aln, the profiling pipeline generates TCGA-formatted miRNA gene expression and isoform expression results by comparing the individual reads to sequence feature annotations in miRBase v.217. Of note, however, the tool only annotates those reads that have an exact match with known miRNAs in miRBase and therefore does not identify novel miRNA or transcript with mutations. Similar to RNA-Seq, miRNA expression data has not been batch corrected.

We compared TPM (Transcripts Per Kilobase Million) normalized miRNA gene expression from GDC GRCh38 pipeline to the original TCGA Hg19 pipeline of the same aliquot using Spearman correlation. Similar to what we have described before for mRNA expression, only 9516 shared aliquots, and 641 shared mature miRNAs are used in this comparison. As shown in the boxplots at the bottom of Fig. 5, the majority of the samples have a correlation coefficient of greater than 0.975, with an average of 0.984. Three cancer types, colon adenocarcinoma (COAD), rectum adenocarcinoma (READ), and ovarian carcinoma (OV), have relatively lower correlation, which may reflect specific sensitivities of miRNA species in these cancer types to the reference genome and miRNA database versions.

Fig. 5: Boxplots of Spearman correlation between GDC and TCGA miRNA expression.
figure 5

Top: Boxplots of Sample to Sample Correlation between GDC and TCGA by Project. n = 80/429/849/312/45/221/47/198/5/532/91/326/326/103/526/424/498/387/87/461/183/187/547/76/263/452/430/156/569/126/444/56/80 biologically independent samples for projects ACC/BLCA/BRCA/CESC/CHOL/COAD/DLBC/ESCA/GBM/HNSC/KICH/KIRC/KIRP/LAML/LGG/LIHC/LUAD/LUSC/MESO/OV/PAAD/PCPG/PRAD/READ/SARC/SKCM/STAD/TGCT/THCA/THYM/UCEC/UCS/UVM, respectively. Bottom: Combined Boxplots and Density Plots of miRNA to miRNA Correlation between GDC and TCGA by Average Expression Level. All miRNAs are categorized in “Low-Expressed” and “Other” groups. Mean and standard deviation of miRNA to miRNA Spearman’s correlations are shown.

Expression level groups are also created in the miRNA datasets for detailed correlation comparisons (Fig. 5. Bottom). Because there are many fewer miRNAs compared to mRNA genes, and many of them are expressed at low expression levels, we categorized miRNA into two groups. In the “Low-Expressed” group, all miRNAs show low (Q1 or Q2) in both TCGA and GDC quantifications; and the rest of miRNAs belong to the “Other” group. The average Spearman correlation in these two groups are about 0.956 and 0.979, respectively.

Array-based copy number variation data processing

TCGA used Affymetrix Genome-Wide Human SNP 6.0 (SNP6) array data to identify genomic regions that have Copy Number Variations (CNV) by aggregation of typed loci into larger contiguous regions. Direct liftover of region boundaries from hg19 to GRCh38 results in fragmented segments with poor data quality due to change of probe loci between different genome builds. For this reason, the GDC SNP6 pipeline is built onto the existing TCGA level 2 tangent-normalized copy number data generated by Birdsuite33 and uses the DNAcopy version 1.44.034 R-package to perform a Circular Binary Segmentation (CBS) analysis35 with GRCh38 probeset metadata. This pipeline normalizes noisy intensity measurements into chromosomal regions of equal copy number in the form of segment mean values, which are equal to log2(copy-number/2), so that diploid regions will have a segment mean of zero, amplified regions will have positive values, and deletions will have negative values.

To be consistent with the original TCGA data, two different output files were produced for each input: copy number segment files that generated from all probes, and masked copy number segment files, equivalent to the original TCGA “nocnv” file, generated by excluding certain probes that have been previously identified to carry copy number variations in a pool of germline samples.

Array-based methylation data processing

TCGA used Illumina Infinium HumanMethylation27 (HM27) and HumanMethylation450 (HM450) BeadChip to measure the level of methylation at known CpG sites as beta values, calculated from array intensities (Level 2 data) as Beta = M/(M + U), where M is the methylated probe intensity and U is the unmethylated probe intensity. The GDC inherited these beta values from existing hg19-based TCGA Level 3 DNA methylation data, and re-annotated each probeset with new metadata information based on GRCh38 and GENCODE v22.

Integrated genomic data clustering

To show how users can take advantages of GDC harmonized datasets for cross-project analysis, here we demonstrate integrated analysis with t-distributed stochastic neighbor embedding (t-SNE)36,37 to reduce high dimensional information in the molecular data to two dimensions (Fig. 6). We utilized four distinct GDC generated molecular features of TCGA primary tumor samples, including mRNA expression, miRNA expression, methylation and copy number variation. Our results are shown only for TCGA data, but could be expanded to other GDC projects.

Fig. 6: 2D t-SNE clustering of 32 TCGA projects.
figure 6

Combined genomic and epigenomic signals of 4 data types from each TCGA patient, including somatic gene-level copy number, somatic RNA expression, somatic miRNA expression and somatic DNA CpG methylation patterns, are aggregated and superimposed into a 2-dimensional space using t-SNE algorithm. Each dot in the plot represents one patient, and patients from different TCGA projects (and cancer types) are distinguished color and shape of the dot.

In our result, different cancer types are well separated and some interesting patterns arise. For example, it has been previously suggested that colon and rectum cancer should be grouped as one colorectal cancer38 and that esophageal adenocarcinomas resembles a subtype of stomach cancer39. Our method supports those observations. In our 2D-clustering the majority of the colon cancer (COAD) samples are co-clustered with rectum cancer (READ) samples, and some esophageal cancer samples cluster together with stomach cancer (STAD) samples. We also identify a few samples designated as primary tumor in one cancer type, but which cluster together with samples from another cancer type. For example, a paraganglioma and phenochromocytoma sample is located within a cluster of adrenocortical carcinoma samples.

Discussion

The rapid decrease in sequencing costs has led to a rapid increase in the resources needed for storage and computation. The GDC provides a solution for this problem by centralizing storage and processing of genomics data. This model currently enables researchers to perform analysis in three different ways: (1) Quick data analysis and exploration using existing GDC visualization tools without the need to download files; (2) Data analysis using GDC generated high level processed data. These files are much smaller than the raw data; (3) Resource-extensive data analysis by downloading raw sequencing data to other data centers or commercial clouds.

The genomic data available through the GDC are analyzed uniformly using common algorithms and pipelines, and will be reanalyzed in the future as improved algorithms and methods are developed. As there is no consensus among the scientific community on the best algorithms on somatic variant detection, the GDC implements four callers, and each generates its own set of variant calling output. As shown in Fig. 2a, b, only 55.97% of the variants are called by all four pipelines; while this category also includes 71.57% of the TCGA validated variants. This suggests a simple strategy to combine results from multiple callers to increase specificity. The boxplots in the bottom of Fig. 3 show the effect of a combined caller approach to the recovery rate of TCGA validated variants by different cancer types. The downside of using a consensus call is a decrease in sensitivity. While we cannot measure sensitivity directly, a good approach may be to combine results from only two somatic callers rather than require a variant be called by all four pipelines. The GDC is in the development stage to generate a new set of MAFs that contain merged results from each tool.

The GDC data harmonization process also makes cross-project analysis much simpler, and reduces artificial findings due to differences in algorithms. Users will be then able to perform joint analysis of data originating from different projects because all data are processed by the same tools and represented in the same format.

Major GDC production workflows are available in GitHub https://github.com/NCI-GDC/gdc-workflow-overview. As workflow updates are inevitable for an on-going data processing project like the GDC, users are able to find workflow versions using the GDC data portal and API. In addition, major version changes are also described in detail in the GDC workflow documentation. When new workflows are introduced, GDC will reprocess old data if necessary.

The NCI’s Genomic Data Commons has provided a feasible and scalable model for easy sharing of large set of genomics data. We hope our efforts in data sharing and uniform data processing will be valuable not only to researchers, but also to clinicians, patients, and other interested parties, and help accelerate the long-term goal of precision oncology.

Methods

Pipeline development and production

The GDC takes advantage of containerization technology and built pipelines using Docker (www.docker.com) to ensure pipeline scalability, portability, and reproducibility. The alignment was managed by an in-house job management system that creates new virtual machines of the designated configuration on demand in an OpenStack environment. The corresponding Docker container that holds the entire workflow then runs inside of the virtual machine for an additional layer of security.

During development of high-level data generation pipelines, we transitioned from single Docker workflows to using Common Workflow Language (CWL, www.commonwl.org) to describe analysis workflows of multiple Dockerized tools. CWL provides an additional transparent layer between workflow description and workflow execution, that allows even better scalability through parallel execution and portability. The GDC managed CWL pipeline production uses the Slurm Workload Manager (slurm.schedmd.com).

DNA-Seq alignment and co-cleaning

The GDC DNA-Seq alignment pipeline follows GATK Best Practices (https://software.broadinstitute.org/gatk/best-practices/). The main steps include regenerating FASTQ files from BAM input on a per-read group basis using biobambam240 and alignment by read group using BWA (version 0.7.12) in both paired-end and single-end mode. This was followed by BAM sort, merge, and MarkDuplicates using Picard41,42. The GDC maps reads using either BWA MEM mode if length is equal or larger than 70 bp or BWA Aln mode if below. BWA Aln is also used when mapping older FASTQ reads formatted with Illumina-1.3 and Illumina-1.5 quality scores. Multiple QC metrics were collected both before and after realignment using FastQC43, samtools41 and Picard42. Re-aligned BAM files from the same patient are then collected together for co-cleaning using GATK 3.6 IndelRealigner and BaseQualityScoreRecalibration.

The TCGA and TARGET BAMs to be harmonized were originally processed over a relatively large time scale in relation to the development of NGS technology. Some originated from as early as 2010 and were generated by a variety of workflows and reference genomes (11 different reference genomes in total including variants of hg18, hg19, GRCh36, and GRCh37). Some of the workflows had introduced incorrect information into the data, such as sample swapping and mislabeled experimental strategies, which were corrected in the GDC harmonization process.

Somatic variant calling, filtering, and MAF generation

The initial GDC release includes four somatic variant callers: MuTect2, VarScan2, MuSE, and SomaticSniper.

MuTect2 is built upon the capability of local de novo assembly by HaplotypeCaller and somatic genotyping engine of Mutect. Mutect applies a Bayesian classifier to detect somatic mutations11. The GDC uses MuTect2 tools from the GATK nightly-2016-02-25-gf39d340 version. Before tumor normal pairs can be used for somatic variant calling, it is important to generate a Panel of Normals (PoNs) filter that contains calling artifacts and potential germline variants. As mentioned previously, whole genome amplified (WGA) samples are analyzed with dontUseSoftClippedBases turned on.

VarScan2 is another somatic variant caller that identifies both SNV and INDELS. It uses heuristics and statistics to identify variants and considers the confounding impacts of read depth, base quality, variant allele frequency and statistical significance13. GDC uses VarScan2 version 2.3.9. The first step of VarScan2 calling is to generate a mpileup file of both tumor and normal BAMs using samtools for a single mpileup file. We set the quality cutoff for samtools to be 1 and also disabled Base Alignment Quality score computation. The mpileup is then used as input to VarScan Somatic to generate a VCF file that contains both SNP and INDEL calls. The resulting VCF is filtered for significant calls using VarScan ProcessSomatic.

MuSE calls somatic variants using Markov Substitution model for Evolution12. The first step, “MuSE call”, estimates the equilibrium frequencies of all four alleles and presents the maximum a posteriori on every genomics locus. The second step, “MuSE sump”, performs a tier based cutoff based on a sample-specific error model which also takes dbSNP information into account. GDC uses MuSE version 1.0rc_submission_c039ffa. Parallelization can be implemented for the first step of MuSE, based on genomic chunks, which can accelerate the production close to linear. The GDC currently only passes calls with quality filter “PASS” to the GDC public MAF files; however, variants with other quality Tier values could also be considered a user’s discretion.

SomaticSniper is a somatic variant caller that only identifies SNPs. It uses a bayesian inference to compare genotype likelihoods between tumor and normals14. GDC uses the default parameter settings of SomaticSniper version 1.0.5.0.

In addition to the built-in filters in each somatic caller, the GDC also applies additional filtering tools to label caller-generated variants. Because these filters are frequently updated, we have highlighted only a few of the major steps below.

  1. A.

    False Positive Filter (FPFilter, https://github.com/ucscCancer/fpfilter-tool) was applied to both VarScan2 and SomaticSniper VCFs.

  2. B.

    SomaticSniper variants with SSC < 25 are removed from annotated VCFs. This is the only step in the entire GDC somatic variant pipeline in which low-quality variants are removed, instead of tagged.

  3. C.

    A WXS Panel of Normals was generated internally by MuTect2 calling on about 5,000 TCGA normal WXS samples in artifact detection mode and combined using GATK CombineVariants. The GDC received the sample list from the TCGA Genomics Data Analysis Center (GDAC) as TCGA normal samples that were previously identified to be free of hematopoiesis events (unpublished) at the time of GDC data processing. This PoN is not only used as a MuTect2 built-in filter44, but also applied to the other three somatic calling outputs in a similar manner.

  4. D.

    d-ToxoG (http://archive.broadinstitute.org/cancer/cga/dtoxog) is used to remove oxoG artifacts from point mutation calls. These artifacts were generated due to oxidative DNA damage during sample preparation45.

  5. E.

    DKFZ Strandbias Filter (https://github.com/eilslabs/DKFZBiasFilter) is used to tag variants that are supported with significant bias from one strand direction compared to the other.

Mutation Annotation Format (MAF) is a tab-delimited text file with aggregated mutation information from VCF Files and are generated on a project-level. The GDC currently produces two types of MAF files: controlled-access MAFs that contain all variants in VCFs, and open-access somatic MAFs that contain filtered variants and reduced germline contaminations and thus considered “high quality”. Any user can explore the open-access somatic MAF for high quality calls; while a more sophisticated user may want to apply for dbGaP access to obtain the superset of mutations in the controlled-access MAF. With the larger set of mutations they may perform custom filtering based on FILTER and GDC_FILTER columns, or collect information that was removed from the open-access version, such as supporting read depth in the normal samples.

The specification of the GDC MAF can be found at https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/.

RNA-Seq alignment

The RNA-Seq alignment pipeline performs alignments of raw reads against the reference genome using a two-pass approach using STAR25. The first pass alignment recognizes splice junctions in the sample, and the second pass uses those splice junctions to perform the final alignment. STAR version 2.4.0f1 was initially used and we then switched to version 2.4.2a that fixed a bug and allowed us to complete processing all the input files. If both BAM and FASTQ input files existed, only FASTQ files were used.

HTSeq mRNA quantification

The HTSeq (version 0.6.p1)26 pipeline is used for calculating the number of reads that align to different genes in the genome. As mentioned previously, only reads that can be uniquely assigned to a gene are counted. The GDC ran HTSeq on all samples as unstranded libraries in order to maintain consistency for cross-sample comparisons.

The raw counts are normalized into Fragment per kilobase million mapped reads (FPKM) and Upper Quartile Normalized FPKM (FPKM-UQ) using all protein-coding genes as the denominator.

$$FPKM = \frac{{RC_g \times 10^9}}{{RC_{pc} \times L}}$$
(1)
$$FPKM - UQ = \frac{{RC_g \times 10^9}}{{RC_{g75} \times L}}$$
(2)

where,

  • RCg: Number of reads mapped to the gene

  • RCpc: Number of reads mapped to all protein-coding genes

  • RCg75: The 75th percentile read count value for genes in the sample

  • L: Length of the gene in base pairs

DexSeq exon quantification

The GDC has also generated exon-level quantification using the DEXSeq27,46 pipeline. The first step in this pipeline is to create the flattened General Feature Format (GFF) file, which essentially collapses the information for multiple transcripts spanning the same exon into exon counting bins for that exon. Once the flattened GFF file is obtained, the number of reads that overlap with each exon counting bin are calculated. The result is a flat file which has raw counts for each exon. This data type is not currently available in the GDC data portal and will appear in a later data release.

miRNA-Seq alignment and profiling

The GDC miRNA harmonization pipeline begins with a realignment of TCGA and TARGET miRNA-Seq reads using a similar strategy of the GDC DNA-Seq alignment pipeline. Because reads of miRNA-Seq are typically short, only BWA Aln was used.

miRNA quantification is done with a modified version of the miRNA Profiling Pipeline v0.2.732 from BCGSC (British Columbia Genome Sequencing Center). In this pipeline, miRNA species and miRNA isoforms are counted differently, and normalized Reads Per Million values are also derived. The final results from each miRNA-Seq sample is a miRNA species quantification file and a miRNA isoform quantification file, in a human-readable format compatible to the original TCGA data.

SNP 6.0 array copy number segmentation

The hg19-based probeset metadata were obtained from the Affymetrix website, and then lifted over to GRCh38. Probes with reference bases not matching between hg19 and GRCh38 were removed.

To generate Copy Number Segment file, all SNP and CNV probes are used for CBS calculation, with the only exception that probes in the Pseudo-Autosomal (PAR) regions were removed in males prior to calculation. To generate the Masked Copy Number Segment file from this result, all probesets in chromosome Y and in the frequent copy number variant regions in germlines obtained from GenePattern were also removed prior to calculation.

Methylation array beta value annotation

Using probe sequence information provided in the manufacturer’s manifest, HM27 and HM450 probes were remapped to the GRCh38 reference genome47. Type II probes with a mapping quality of <10, or Type I probes for which the methylated and unmethylated probes map to different locations in the genome, and/or had a mapping quality of <10, had an entry of ‘*’ for the ‘chr’ field, and ‘−1’ for coordinates47. These coordinates were then used to identify the associated transcripts from GENCODE v22, the associated CpG island, and the CpG sites’ distance from each of these features. Multiple transcripts overlapping the target CpG were separated with semicolons. Beta values were inherited from existing TCGA Level 3 DNA methylation data (hg19-based) based on Probe IDs.

Variant comparison

The same genetic variant can be represented in VCF format in multiple different ways48, and many of these discrepancies can not be easily solved by existing normalization tools. In order to reduce false-positive annotations, GDC requires a strict matching of Chromosome, Position, and Alternative Alleles during implementation of MAF annotations. However, in various variant comparisons in this paper, we applied a loose matching strategy to regard two variants the same if they have overlapping regions between starting and ending positions. This is particularly useful when a non-INDEL caller, such as SomaticSniper or MuSE, represents INDEL sites as point mutations.

t-SNE clustering

mRNA expression count, miRNA expression count, Copy Number Segmentation, and Methylation Beta Values were collected from the GDC Data Portal. For mRNA expression and miRNA expression, we removed low-expressed genes and miRNAs if 99% or more samples have less than or equal to 1 count. For Copy Number Segments, we computed average segmentation means on each gene weighted on overlapped length between segments and genes. For methylation data, we removed probes that are empty in more than 5% of the samples, and imputed the remaining empty values with the probeset mean. The methylation data is also randomly down-sampled by 25% of the probes to reduce computational burden.

To integrate these data together for t-SNE clustering, we first computed the top 200 Principal Components (PCs) from each of the four data types using the R function prcomp. The top 200, ranked by “variation explained”, from these 800 combined PCs were selected and further scaled to the arbitrary weights of 3:3:1:1 for mRNA:Methylation:miRNA:CNV, and finally used as input for the t-SNE analysis with the R package Rtsne. We gave less weights to miRNA and copy number because they don’t separate cancer types as well compared to the other two data types. We found that the 3:3:1:1 ratio gives the best visual performance in the scatter plot compared to the other combinations tested. We ran t-SNE 1000 times with random seeds and displayed the result that minimizes the cost function49.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.