Whole-exome sequencing (WES), the analysis of all known protein-coding sequences in the human genome, has been used for many years in clinical genetic diagnostics. While whole-genome sequencing (WGS) is becoming cheaper as sequencing costs decrease, it is still significantly more expensive in terms of sequencing, data processing, and storage costs. At an average coverage of 30x, a WGS dataset contains more than 6 times the sequencing data of a typical WES analysis at 120x. On the other hand, WGS provides coverage of nearly the complete genome while WES is limited to coding regions and proximal regions such as intronic borders and UTR. Furthermore, the enrichment used in WES protocols may result in more uneven coverage, leaving some relevant coding regions with insufficient coverage.
To understand these differences, we have compared a deep-sequenced WGS dataset (average coverage 133x) with WES data. The analysis has two aims: Firstly, to establish the difference in performance of the two methods in terms of covering diagnostically relevant regions. To this end, we used a combined database of all known coding sequences from CCDS and all known disease-causing noncoding mutations from HGMD. Secondly, to show that most missing regions are not due to the library preparation (WES vs. WGS), but due to issues of mappability and necessary filtering steps during data processing. By using deep WGS as the gold standard, our results can help researchers understand the varying claims by sequencing providers with respect to the strengths and limitations of WES analyses.
For WES analyses, we show data both from a commercially available exome kit as well as from CeGaT’s proprietary Exome Xtra. CeGaT Exome Xtra is based on Twist’s Core exome, adding Twist’s RefSeq spike-in to cover further relevant genes and transcripts. It is augmented by adding in (1) all manually curated coding and non-coding regions from CeGaT’s over 20 diagnostic panels covering hundreds of inherited diseases, (2) all pathogenic and likely pathogenic non-coding variants described in HGMD, (3) all pathogenic and likely pathogenic non-coding variants published in ClinVar, (4) the complete mitochondrial genome, (5) remaining coding regions from CeGaT’s gene database which is based on CCDS, Gencode, Ensembl, and RefSeq curated, and (6) regions with pharmacogenetically relevant variants in selected genes.
We evaluate three metrics: Average coverage, evenness of coverage (fold 80), and completeness of coverage (%covered ≥30x). Evenness of coverage is an important metric to understand how efficient a protocol is, as a more even coverage means that more regions are well covered with a smaller amount of raw data needed. Fold80 measures the width of the coverage distribution (see figure 1), its optimal value is 1.0 (all regions covered equally), and good enrichment protocols reach values of 1.1-1.3. Finally, completeness is the most important metric for clinical diagnostics, as incomplete coverage means that some regions cannot be evaluated leading to reduced sensitivity of the test.