In recent years, Next-Generation Sequencing (NGS) technologies have revolutionized the field of genomics, enabling researchers to study the genomes of organisms with unprecedented speed and accuracy. However, the large amounts of data generated by these technologies can be overwhelming, making it challenging to extract meaningful information. Variant calling and analysis is a crucial step in NGS data analysis, allowing researchers to identify genetic variations and understand their potential impact on biological function. In this article, we provide an overview of variant calling and analysis in NGS data, covering the basic concepts, methods, and tools used in this process.
What is Variant Calling?
Variant calling is the process of identifying differences between a reference genome and the genome of an individual or a population of individuals. These differences, known as variants, include single nucleotide polymorphisms (SNPs), insertions, deletions, and structural variations. Variant calling is an important step in NGS data analysis, as it provides insights into the genetic variation that underlies phenotypic differences between individuals and populations.
Variant Calling Methods
There are several methods used for variant calling in NGS data analysis, each with its advantages and disadvantages. The most commonly used methods are listed below:
Alignment-based methods involve mapping the sequencing reads to a reference genome and identifying variants based on the differences between the reads and the reference genome. These methods include Samtools, BWA/GATK, and FreeBayes.
De novo assembly-based methods
De novo assembly-based methods involve assembling the sequencing reads into contigs or scaffolds, and then identifying variants based on the differences between the assembled genome and the reference genome. These methods include ABySS and SOAPdenovo.
Hybrid methods combine both alignment-based and de novo assembly-based methods, providing the advantages of both approaches. These methods include FermiKit and Cortex.
Once variants have been called, they need to be analyzed to understand their potential impact on biological function. This analysis involves annotating the variants, predicting their functional consequences, and interpreting the results in the context of existing knowledge. The most commonly used tools for variant analysis are listed below:
Variant annotation tools
Variant annotation tools provide information on the functional impact of variants, such as their location in the genome, their effect on protein coding regions, and their frequency in the population. These tools include ANNOVAR, SnpEff, and VEP.
Pathway and network analysis tools
Pathway and network analysis tools analyze the impact of variants in the context of biological processes. Tools such as Variant Enrichment Analysis (VEA), GENEASE, and Pathvisio provide insight into the potential functional consequences of variants by leveraging information from curated databases (Reactome, KEGG, wikipathways).
Challenges in Variant Calling and Analysis
Variant calling and analysis in NGS data can be challenging, as the data is complex and noisy, and the methods and tools used are not perfect. Some of the challenges in variant calling and analysis are listed below:
False positives and false negatives
Variant calling methods can produce false positive and false negative results, leading to errors in downstream analysis. False positives occur when a variant is called where there is no variation, while false negatives occur when a variant is missed.
Variant calling methods rely on a reference genome, which may not represent the genetic diversity within a population or species. This reference bias can result in the underrepresentation of certain types of variants or the overrepresentation of others.
Allelic drop-out and amplification bias
Allelic drop-out and amplification bias can occur during the amplification of DNA samples for sequencing, leading to incomplete or biased representation of alleles in the sequencing data. This can affect the accuracy and completeness of variant calling and analysis.
Data size and computational resources
NGS data analysis can generate large amounts of data, requiring significant computational resources for storage, processing, and analysis. The size of the data can also affect the accuracy and completeness of variant calling and analysis, as some methods may not be scalable to larger datasets.
To combat this, there are different ways in which this process can be accelerated. First and foremost GATK can be parallelized by running multiple instances of the tool on different regions of the genome simultaneously. This can be achieved using tools like GNU Parallel or Apache Spark. Second, increasing the available computing power by upgrading to faster CPUs, adding more RAM or using GPUs & FPGAs can significantly reduce the processing time. Preprocessing steps such as base quality score recalibration (BQSR – see our earlier blog on that here) and duplicate marking can be time-consuming. Running these steps in parallel or optimizing them can reduce the overall processing time. Downsampling the data to a subset of reads can also significantly reduce the processing time without compromising the quality of the results and is particularly useful for testing or prototyping purposes. Some variant calling algorithms are faster than others and companies like Sentieon have optimized GATK for both speed and accuracy. Finally, running these methods either directly on cloud computing platforms like AWS or through bioinformatics platforms such as Basepair can provide scalable and cost-effective computing resources, making it possible to process large amounts of data in a shorter amount of time.