Basepair develops SaaS to manage, analyze and interpret next generation sequencing (NGS) data. Scientists and physicians at Harvard Medical School, NYU, UCSD, etc. use Basepair to make breakthrough medical discoveries in cancer, AIDS, etc, and provide better medical care to patients. Founded by a scientist from Harvard Medical School, Basepair is driven by the mission to use technology to improve healthcare. Our team is analytical, fast-paced and informal.
This is an opportunity to make a significant impact at an early stage startup. We’re a small team, working closely together to make people’s lives better! Your voice will be heard and you will have the autonomy to set your vision. We strongly encourage ongoing learning to be the best in your position. We provide a competitive salary, meaningful equity excellent healthcare and other benefits.
Before starting the analysis, it is important to know if the raw data is good quality. The primary metric to check this is the quality score. [The quality score is created as part of the sequencing process and is included in the Fastq file.] Next important step is to check for overrepresented sequences. These arise due to contamination of sample, dimerization of sequencing adapters. Further, you can check for distribution of the 4 bases, % GC, etc.
Alignment (aka mapping)
In this step we take each read and find the matching region in the reference genome. This is the most computationally intensive step because the matching needs to permit mismatches, indels, etc., between the read and the reference. The alignment output is stored in bam file format, which contains read name, sequence, quality score, position of alignment and other meta data.
Since we know that the sequence data is derived from mRNA, this information can be used to access the quality of the data. We measure the percentage of reads aligning to exons, introns and intergenic regions. Greater than 70% alignment to exons is considered good.
The (digital) expression count is the number of reads aligning to the exons of a gene. The count could be done at two levels, either for each transcript or for each gene. Both approach have their pros and cons. For genes with multiple isoforms, using transcript counts lets you identify the specific isoforms that are expressed. However, since the number of transcripts is greater than number of genes, using the former leads to a higher multiple hypothesis correction for differential expression test.
Apart from the expression count of each gene, it is useful to be able to see the raw data at loci of interest. The aligned reads are converted to genome coverage, which is the number of reads at each position in the genome. The most common formats for genomic coverage supported by most genome browsers are wig and bigWig. The bigWig files no longer contain individual reads and are much smaller than the bam files.
Up and down regulated genes
After running the differential expression test, you may create lists of up and down regulated genes. The least stringent threshold is a (multiple hypothesis corrected) P-value lower than 0.05. Additionally, genes may be filtered if the fold change is less than a threshold. For most downstream analysis, it is good to have a list of 200-1000 genes.
The heatmap is a visual representation of showing the changes in gene expression across the samples. The values in the rows are usually normalized, the color red used to show genes with higher expression and blue for lower expression.
The two most important metric for differential expression is the the p-value which captures the statistical significance and the fold change which shows the effect size. The relationship between them is in the shape of a V, as genes with very low fold change (e.g., going down 2-fold) and very high fold change (e.g, going up 2.5 fold) have a low p-value. The volcano plot is useful to determine the thresholds for making up-regulated and down-regulated genes.