Introduction to Variant Calling: QC, Alignment, Deduplication, Variant Annotation

Amit U Sinha, Ph.D
Last Updated: January 3, 2020

The variant calling pipeline identifies single nucleotide variants present within whole genome and exome data. The variants are identified by comparing the datasets of an individual with a reference sequence. Variant analysis is a crucial procedure for whole exome, targeted panels, and whole genome sequencing. The variant calling pipeline consists of a series of interlinked sequential steps: 

Quality Control

The first step of a variant calling pipeline involves the evaluation of the quality of raw sequencing data. Sequencing platforms such as Illumina produce raw reads in FASTQ format, which contains a nucleotide sequence and associated quality scores. The reads with base calls that have poor quality are removed. Adapter sequences, which remain attached to the raw reads, need to be removed before the downstream analysis. The choice of tool depends on the data type, amount of adapter content, and other sequencing artifacts. Tool speed and accuracy are also important factors.

The final step is the removal of very short reads with fewer than 20 bases. This is because shorter reads are more likely to ambiguously map to multiple locations on the reference genome and cause biases in SNP calling.

 

Input/Output Entity Tool Format
Input Raw reads Trimmomatic / cutadapt FASTQ
Output Quality controlled reads FASTQ 


Alignment

Filtered reads are mapped to the reference genome using burrows wheeler aligner (BWA-mem) or BWA-aln algorithms [1]. Additional aligners such as Bowtie-2 can also be used depending on the size of raw (single or paired-end) reads [2]. All the aligners take raw reads in FASTQ format as input and produce sequence alignment mapping format (SAM) files. In the subsequent steps, the SAM file is converted into a binary alignment file format (BAM) to reduce the storage size of the alignment file. Details of the files used in the alignment and expected results file formats are given below: 

Input/Output Entity Tool Format
Input Raw reads

BWA or bowtie

Picard

FASTQ
Output Aligned reads SAM/BAM

 Deduplication

Multi-mapped, duplicated, and supplementary reads must be removed from the downstream analysis to reduce the chances of false positive results. We use the Picard tool for this purpose. Only uniquely aligned reads are used in downstream variant identification analysis. 

Input/Output Entity Tool Format
Input Mapped reads Picard BAM
Output Uniquely aligned reads BAM

Local Realignment Around Indels and Variant Calling

The alignment step may produce artifacts, especially around the indels. In some cases, the reads covering the start or the end of an indel are incorrectly mapped, which results in variations between the reference and the reads near the misaligned regions. The local realignment step corrects these artifacts.

The Genome Analysis Toolkit (GATK) is an important realignment tool. GATK calls raw variants for each sample read, analyzes these variants against a known variant to apply a calibration method, and computes the false discovery rate for each variant. A GATK algorithm called HaplotypeCaller identifies all the possible variants in the processed aligned reads. GATK outputs the raw variants in a variant calling file (VCF format). Details of the tools and their output file formats are given below. 

Input/Output Entity Tool Format
Input Uniquely aligned reads

GATK

HaplotypeCaller

BAM
Output List of variants (SNP’s) VCF

 Variant Annotation

The variant annotation step aims to identify the function and effect of all identified SNPs using SNP annotation tools. In the annotation phase, the biological information is extracted. The functional information is assigned to DNA variants based on available information such as nucleic acid and protein sequences. SnPEff is an open source variant annotation tool [3]. It predicts the effects of variants on genes by using a computational algorithm to detect deleterious variants. Moreover, it annotates the variants based on their genomic locations and predicts their coding effects. Basepair uses two variant databases: dbSNP, the most comprehensive database for nucleotide variations, and ClinVar, which contains a collection of reports of the relationship between human variations and phenotypes [4]. The data is collected in ClinVar from clinical tests, research studies, and other literature.

Importance of Identification of Variants and Annotation

Variant identification generates a detailed catalog of variations in the individual’s genome and is responsible for identifying the underlying reasons for different diseases and specific DNA changes. Variants play a crucial role in genome-wide association studies and act as important markers. More precisely, variants help in disease’s gene discovery. Identifying those genomic variants that are key players in disease helps achieve fruitful targets for precision medicine. Most of the mutations are linked with Mendelian disorders. Moreover, SNP based arrays such as axiom array help in the improvement of crop yield. SNP annotation is an important method to computationally predict deleterious effects of SNPs and their role in diseases in living organisms. SNP annotation also identifies the SNPs present in exonic, transcription regulatory, and many other functional genomic regions. 

Visualization of SNPs

Genome browsers have enabled researchers to visualize their aligned reads. This is an important step in investigating the data. Genome browsers like those offered by Basepair provide an opportunity to see the variants present in the aligned reads. 

Variant Validation

Single nucleotide variants can be validated by using Sanger sequencing or micro-array genotyping from genome wide (GWAS) studies. Sanger sequencing is considered a gold standard technology for the confirmation and validation of SNPs. The variant calls can be genotyped using various Affymetrix genome-wide SNPs assays [5]. Apart from that, a computational algorithm called MutationValidator performs variant cross validation by generating a validation matrix and classifies mutations as somatic, germline, or artifactual using NGS technologies.

Learn more about Basepair’s whole exome and whole genome sequencing pipelines on our product page.

References 

[1] Li, H., & Durbin, R. (2010). Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics, 26(5), 589-595.

[2] Langmead, B., & Salzberg, S. L. (2012). Fast gapped-read alignment with Bowtie 2. Nature methods, 9(4), 357. 

[3] Cingolani, P., Platts, A., Wang, L. L., Coon, M., Nguyen, T., Wang, L., … & Ruden, D. M. (2012). A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly, 6(2), 80-92.

[4] Landrum, M. J., Lee, J. M., Riley, G. R., Jang, W., Rubinstein, W. S., Church, D. M., & Maglott, D. R. (2013). ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic acids research42(D1), D980-D985.

[5] Pirooznia, M., Kramer, M., Parla, J., Goes, F. S., Potash, J. B., McCombie, W. R., & Zandi, P. P. (2014). Validation and assessment of variant calling pipelines for next-generation sequencing. Human genomics, 8(1), 14.

 

Recent Posts

Introduction to Variant Calling: QC, Alignment, Deduplication...

A Deep Dive Into Differential Expression

Overview of Transcriptome Analysis

A Brief Introduction to Single Cell Sequencing

Whole Genome Sequencing Analysis: An Overview

Want to try Basepair’s variant calling pipelines? Sign up for a 14-day trial and run unlimited analyses on up to 6 samples for free.