Whole genome sequencing (WGS) refers to the sequencing of the entire genome from a sample in a high-throughput manner. Unlike Sanger sequencing, WGS is a next generation sequencing technology that achieves comprehensive coverage of the genome with a better cost benefit than Sanger sequencing. The WGS process is based on a shotgun approach, which consists of a fragmentation process to generate short sequences. In general, the data processing steps of whole genome sequencing analysis aim to identify variants using the short reads generated in the sequencing by mapping to a reference genome. Alternatively, if the reference genome from an organism is unknown, a de novo assembly approach can be used to build contiguous sequences. Following data processing steps, different approaches can be used to perform single nucleotide variant calling, identify structural variants, copy number variation analysis, and haplotype studies [1, 2].
Whole Genome Sequencing Analysis Pipeline
The computational tools used for genome sequence analysis are based on several processing steps. By running the FASTQ file obtained after sequencing through several analysis pipelines, we are able to obtain the final file that contains genetics variants in a sample. These steps can be classified as a pre-VCF step, which includes all data processing approaches necessary to generate a VCF file, and a post-VCF step for extracting and annotating variants from an existing VCF. These steps include quality control, alignment, data post-processing, variant calling, filtering and annotation [3, 4].
Using the FASTQ files generated during sequencing, the quality control step aims to estimate metrics and visualize the quality of the fragments generated. Artifacts and low-quality fragments identified in the data can be removed according to parameters previously defined. Here, important metrics such as depth, coverage, sequence adapter identification, and percentage of errors are inferred. Some bioinformatics tools used in this step include Fastqc and fastp tools, both of which are used by Basepair in our WGS pipelines.
When the reference genome is known, the alignment of short reads to the reference genome usually requires a genome indexing step that aims to reduce and improve computational efficiency during the mapping process. The files produced during indexing may vary according to the software used. Next, the reads are mapped to the reference sequence. The most commonly-used software for WGS data is BWA. Basepair offers a pipeline using this tool. As an output of this step, a SAM or BAM file that contains information on aligned reads is generated. In the case of de novo assembly, the algorithms used to perform this analysis are based on contig assembly, scaffolding, and gap-filling in the draft genome from the sequenced fragments. Basepair offers a de-novo assembly pipeline that uses the Trinity tool.
After obtaining the SAM or BAM files, reads that are uniquely mapped to the reference genome must be sorted and filtered. This post processing step helps minimize errors during variant calling. Samtools, sambamba and Picard are software tools widely used to manipulate BAM and SAM files. Reads that map to multiple places in the genome as well as duplicate reads are generally not used for variant calling. Additionally, due to mapping bias introduced by mismatches in insertions and deletions regions, INDEL realignment is another best practice recommended for whole genome sequencing analysis. This post-processing step is performed on Basepair using GATK software.
Variant Calling and Filtering
The variant calling step aims to identify the polymorphic regions in the DNA of a sample. Usually, the algorithms used to perform this analysis are based on the likelihood that a given variant (SNV or INDEL) exists in a position of the BAM file. The identified variants are stored in a VCF file format. The next step is to filter the variants to only retains those that meet the minimum quality criteria required, such as base quality and depth. Finally, an additional annotation step can be used to integrate information and improve variant filtration and prioritization. Learn more about the databases Basepair uses for this step in our detailed overview of variant calling.
High-throughput sequencing can also be used to identify large genetic variants greater than 50 bp in size, such as structural variants or copy number variation. These variants include imbalanced deletions and duplications, insertions, inversions, and translocations. By using the information of discordant alignment and depth features, this approach allows for the detection of a large number of SVs and CNVs in a sample. Basepair uses GATK for CNV analysis and Manta for structural variant discovery.
The variants called from WGS analysis might be useful to reconstruct phased haplotypes by combining data from public population databases and incorporating pedigree information.
1. Rahman, Kathleen M., Meredith E. Camp, Nripesh Prasad, Anthony K. McNeel, Shawn E. Levy, Frank F. Bartol, and Carol A. Bagnell. 2016. “Age and Nursing Affect the Neonatal Porcine Uterine Transcriptome.” Biology of Reproduction 94 (2): 46.
2. Ng, Pauline C., and Ewen F. Kirkness. 2010. “Whole Genome Sequencing.” Methods in Molecular Biology 628: 215–26.
3. Kosugi, Shunichi, Yukihide Momozawa, Xiaoxi Liu, Chikashi Terao, Michiaki Kubo, and Yoichiro Kamatani. 2019. “Comprehensive Evaluation of Structural Variation Detection Algorithms for Whole Genome Sequencing.” Genome Biology 20 (1): 117.
4. Auwera, Geraldine A. Van der, Geraldine A. Van der Auwera, Mauricio O. Carneiro, Christopher Hartl, Ryan Poplin, Guillermo del Angel, Ami Levy-Moonshine, et al. 2013. “From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline.” Current Protocols in Bioinformatics.