Introduction to Variant Calling: QC, Alignment, Deduplication, Variant Annotation
Amit U Sinha, Ph.D
Last Updated: January 3, 2020
The variant calling pipeline identifies single nucleotide variants present within whole genome and exome data. The variants are identified by comparing the datasets of an individual with a reference sequence. Variant analysis is a crucial procedure for whole exome, targeted panels, and whole genome sequencing. The variant calling pipeline consists of a series of interlinked sequential steps:
Quality Control
The first step of a variant calling pipeline involves the evaluation of the quality of raw sequencing data. Sequencing platforms such as Illumina produce raw reads in FASTQ format, which contains a nucleotide sequence and associated quality scores. The reads with base calls that have poor quality are removed. Adapter sequences, which remain attached to the raw reads, need to be removed before the downstream analysis. The choice of tool depends on the data type, amount of adapter content, and other sequencing artifacts. Tool speed and accuracy are also important factors.
The final step is the removal of very short reads with fewer than 20 bases. This is because shorter reads are more likely to ambiguously map to multiple locations on the reference genome and cause biases in SNP calling.
Input/Output | Entity | Tool | Format |
Input | Raw reads | Trimmomatic / cutadapt | FASTQ |
Output | Quality controlled reads | FASTQ |
Alignment
Filtered reads are mapped to the reference genome using burrows wheeler aligner (BWA-mem) or BWA-aln algorithms [1]. Additional aligners such as Bowtie-2 can also be used depending on the size of raw (single or paired-end) reads [2]. All the aligners take raw reads in FASTQ format as input and produce sequence alignment mapping format (SAM) files. In the subsequent steps, the SAM file is converted into a binary alignment file format (BAM) to reduce the storage size of the alignment file. Details of the files used in the alignment and expected results file formats are given below:
Input/Output | Entity | Tool | Format |
Input | Raw reads |
BWA or bowtie Picard |
FASTQ |
Output | Aligned reads | SAM/BAM |
Deduplication
Multi-mapped, duplicated, and supplementary reads must be removed from the downstream analysis to reduce the chances of false positive results. We use the Picard tool for this purpose. Only uniquely aligned reads are used in downstream variant identification analysis.
Input/Output | Entity | Tool | Format |
Input | Mapped reads | Picard | BAM |
Output | Uniquely aligned reads | BAM |
Local Realignment Around Indels and Variant Calling
The alignment step may produce artifacts, especially around the indels. In some cases, the reads covering the start or the end of an indel are incorrectly mapped, which results in variations between the reference and the reads near the misaligned regions. The local realignment step corrects these artifacts.
The Genome Analysis Toolkit (GATK) is an important realignment tool. GATK calls raw variants for each sample read, analyzes these variants against a known variant to apply a calibration method, and computes the false discovery rate for each variant. A GATK algorithm called HaplotypeCaller identifies all the possible variants in the processed aligned reads. GATK outputs the raw variants in a variant calling file (VCF format). Details of the tools and their output file formats are given below.
Input/Output | Entity | Tool | Format |
Input | Uniquely aligned reads |
GATK HaplotypeCaller |
BAM |
Output | List of variants (SNP’s) | VCF |
Variant Annotation
The variant annotation step aims to identify the function and effect of all identified SNPs using SNP annotation tools. In the annotation phase, the biological information is extracted. The functional information is assigned to DNA variants based on available information such as nucleic acid and protein sequences. SnPEff is an open source variant annotation tool [3]. It predicts the effects of variants on genes by using a computational algorithm to detect deleterious variants. Moreover, it annotates the variants based on their genomic locations and predicts their coding effects. Basepair uses two variant databases: dbSNP, the most comprehensive database for nucleotide variations, and ClinVar, which contains a collection of reports of the relationship between human variations and phenotypes [4]. The data is collected in ClinVar from clinical tests, research studies, and other literature.
Importance of Identification of Variants and Annotation
Variant identification generates a detailed catalog of variations in the individual’s genome and is responsible for identifying the underlying reasons for different diseases and specific DNA changes. Variants play a crucial role in genome-wide association studies and act as important markers. More precisely, variants help in disease’s gene discovery. Identifying those genomic variants that are key players in disease helps achieve fruitful targets for precision medicine. Most of the mutations are linked with Mendelian disorders. Moreover, SNP based arrays such as axiom array help in the improvement of crop yield. SNP annotation is an important method to computationally predict deleterious effects of SNPs and their role in diseases in living organisms. SNP annotation also identifies the SNPs present in exonic, transcription regulatory, and many other functional genomic regions.
Visualization of SNPs
Genome browsers have enabled researchers to visualize their aligned reads. This is an important step in investigating the data. Genome browsers like those offered by Basepair provide an opportunity to see the variants present in the aligned reads.
Variant Validation
Single nucleotide variants can be validated by using Sanger sequencing or micro-array genotyping from genome wide (GWAS) studies. Sanger sequencing is considered a gold standard technology for the confirmation and validation of SNPs. The variant calls can be genotyped using various Affymetrix genome-wide SNPs assays [5]. Apart from that, a computational algorithm called MutationValidator performs variant cross validation by generating a validation matrix and classifies mutations as somatic, germline, or artifactual using NGS technologies.
Learn more about Basepair’s whole exome and whole genome sequencing pipelines on our product page.
References
[1] Li, H., & Durbin, R. (2010). Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics, 26(5), 589-595.
[2] Langmead, B., & Salzberg, S. L. (2012). Fast gapped-read alignment with Bowtie 2. Nature methods, 9(4), 357.
[3] Cingolani, P., Platts, A., Wang, L. L., Coon, M., Nguyen, T., Wang, L., … & Ruden, D. M. (2012). A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly, 6(2), 80-92.
[4] Landrum, M. J., Lee, J. M., Riley, G. R., Jang, W., Rubinstein, W. S., Church, D. M., & Maglott, D. R. (2013). ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic acids research, 42(D1), D980-D985.
[5] Pirooznia, M., Kramer, M., Parla, J., Goes, F. S., Potash, J. B., McCombie, W. R., & Zandi, P. P. (2014). Validation and assessment of variant calling pipelines for next-generation sequencing. Human genomics, 8(1), 14.