Tutorial on Whole Exome Sequencing Analysis

Amit U Sinha, Ph.D
Last Updated: Nov 7, 2019

Next generation sequencing (NGS) methods have increasingly enabled large-scale DNA sequencing analysis in a massively parallel manner. Within NGS methods, whole exome sequencing (WES) aims to sequence and detect variations in the exonic regions of the genome.

WES vs WGS: Advantages and Disadvantages

Although whole genome sequencing (WGS) techniques can be used to perform genetic diagnosis, depending on disease type and complexity, WES can be a better method. WES is, first of all, cheaper  it has lower data storage costs and a less laborious downstream data analysis than WGS. There is a substantial reduction in data storage, with 90 GB or more needed for a typical WGS file, compared to 5-6 GB for a WES file.

The biggest advantage of WGS is that it has a higher coverage and allows for the detection of more variants types.  But even though only 2% of the genome corresponds to coding regions, about 90% of known disease-causing variants are mapped here. Therefore, despite their differences in coverage, whole exome sequencing analysis maintains its status as a cost-effective alternative to whole genome sequencing.

Clinical Relevance of Mutations and Structural Variants

The WES approach has applications ranging from point variant to structural variant identification. Within the point mutation class, single nucleotide variants (SNVs) are the most frequent type observed. The common types of SNVs studied include synonymous, missense, nonsense, in-frame, frameshift, and splice-site mutations. Insertions or deletions (indels) of 2-30 basepairs are another common type of mutation detected by WES. Although WGS is generally preferred for the identification of structural variants, WES also allows for the detection of copy number variants (CNVs) and other chromosomal deletions.

During downstream analysis, mutation class identification has a strong influence on determining the clinical relevance of the variant. In general, most variants identified in WES analysis are synonymous and therefore do not affect the protein encoded, save for some specific cases. Similarly, depending on the probe set design, WES may also detect a few intronic mutations, which typically do have clinical relevance. In contrast, missense variants cause amino acid changes in the protein and can be highly informative, depending on the disease mechanism. Nonsense and frameshift mutations may have a drastic effect on protein function, since they cause a premature stop codon and alter the DNA reading frame by insertion or deletion of base pairs, respectively. Moreover, in-frame mutations lead to insertion or deletion of a base pair and, unlike frameshift mutations, always lead to triplet indels.

Probe Design

A crucial step during WES consists of exon enrichment, in which coding regions are captured through the hybridization of DNA probes. Typically, these probes bind magnetic beads and further precipitate and amplify with the target sequence. Available commercial kits may differ in the probe type and capture method, so it is essential to consider the exome capture kit used — a poor choice could lead to non-uniform coverage of some regions. Probes can also be custom-designed, depending on the investigation goals. To this end, public databases can be used to select target regions to be amplified. Some details must be considered before designing probes for targeting exons, however; many factors can alter the quality of WES results, such as GC rich regions, quality of DNA fragment, insert size, and presence of repetitive elements in the sequence.

Analyzing WES Data

High-quality results in exome analysis are highly associated with how the dataset is processed. Thus, protocols for whole exome sequencing data analysis include several steps such as quality control (QC), raw reads preprocessing, short reads mapping, post-alignment processing, variant calling and annotation, and variant prioritization. Due to the possible presence of contaminants and artifacts such as sequencing errors, low-quality reads, adaptors, and duplicates introduced during the sequencing process, QC metrics assess the quality of the data by generating basic statistical measures regarding depth, coverage, sequence adapter identification, GC content, and base distribution. Basepair’s pipelines implements QC using the fastp tool.

Since artifacts are present in raw data, read preprocessing steps like trimming, filtering, or adaptor clipping are strongly recommended to avoid mapping biases during the reading alignment step. For reads mapping to the reference genome, Basepair supports two leading tools: Bowtie and BWA. Both perform reference-based mapping. Here, critical QC metrics such as depth and coverage of genomic regions are evaluated. After this, post-alignment processing steps remove multi-mapped and duplicated reads to minimize allelic biases during the variant calling step. 

The variant calling step calculates the probability that a genetic variant is truly present in the sample analyzed. One of the most popular software packages for variant calling is GATK. To avoid false-positive SNP calls, it is important to set proper parameters, such as maximum read depth per position, minimum number of gapped reads, and base alignment quality recalculation to improve the base quality called. Additionally, variant annotation aims to integrate relevant information about each variant called. Here, softwares like SnpEff/SnpSift and VEP help annotate variant types, their effects on genes (like changes in amino acids), impact, and frequency of occurrence in human populations (e.g. using the DbSNP database). This information is crucial for performing downstream filtration and prioritization in exome sequencing analysis.

Hundreds to thousands of variants can potentially be obtained from exome sequencing. Here, it is very challenging to reduce the search space for causative variants. Overall, users can sort variants found by effect, impact of mutations, and zygosity. More sophisticated statistical tests might be useful, though they usually require a considerable sample size. As an alternative to direct data filtration, using WES data, users can perform genome-wide association studies (GWAS), phenotype- or genotype-based approaches, gene-specific analysis, and family-based studies depending on the experimental study design.

Learn more about Basepair’s whole exome sequencing pipelines on our product page.

References

1. Hintzsche, Jennifer D., William A. Robinson, and Aik Choon Tan. 2016. “A Survey of Computational Tools to Analyze and Interpret Whole Exome Sequencing Data.” International Journal of Genomics and Proteomics 2016 (December): 7983236.

2. Pabinger, Stephan, Andreas Dander, Maria Fischer, Rene Snajder, Michael Sperk, Mirjana Efremova, Birgit Krabichler, Michael R. Speicher, Johannes Zschocke, and Zlatko Trajanoski. 2014. “A Survey of Tools for Variant Analysis of next-Generation Genome Sequencing Data.” Briefings in Bioinformatics 15 (2): 256–78.

3. Retterer, Kyle, Jane Juusola, Megan T. Cho, Patrik Vitazka, Francisca Millan, Federica Gibellini, Annette Vertino-Bell, et al. 2016. “Clinical Application of Whole-Exome Sequencing across Clinical Indications.” Genetics in Medicine: Official Journal of the American College of Medical Genetics 18 (7): 696–704.

4. Suwinski, Pawel, Chuangkee Ong, Maurice H. T. Ling, Yang Ming Poh, Asif M. Khan, and Hui San Ong. 2019. “Advancing Personalized Medicine Through the Application of Whole Exome Sequencing and Big Data Analytics.” Frontiers in Genetics 10 (February): 49.

Recent Posts

A Deep Dive Into Differential Expression

Overview of Transcriptome Analysis

A Brief Introduction to Single Cell Sequencing

Whole Genome Sequencing Analysis: An Overview

Want to try Basepair’s WES analysis pipelines? Sign up for a 14-day trial and run unlimited analyses on up to 6 samples for free.