Among the factors responsible for scaring people away from bioinformatics, the various input and output filetypes surely rank very high.
If you want to plot heat maps using ngsplot, you need a bam file. If you want to use deeptools instead, also to make heat maps, you need a bigwig file.
What is a bigWig file you ask? It is a file format developed by the fine folks at UCSC, used to visualize genomic coverage. Instead of UCSC’s genome browser, you want to use Broad’s IGV browser? Bigwig will work but IGV will want you to make TDF files, which are not useful anywhere else.
Filetypes get murkier when you begin looking at interval files — the one where you need to save genomic locations: chromosome this, start-end that. Now you are looking at bed or gff or gtf. The last two are the same, but not really. Nobody knows anymore. And if these 3 filetypes are not enough, feature counts generously offer to support SAF format. UCSC will not have any of this. Having taken matters in their own hands, then have introduced refFlat, refGene, etc.
Now bed files can also be extended to store paired-end data – welcome BEDPE. Please join the ranks of BED3, BED4, BED5, BED6 and BED12.
An important criterion when choosing filetypes is human readable (e.g., tsv) vs machine readable formats (e.g., json). Hello VCF – the filetype that has managed the impossible feat of being neither machine readable nor human readable. By using a combination of tab separated values with concatenated fields, it needs a sophisticated parser for access. And it is sure to crash your excel should you try to open it on your desktop for a quick look.
If there is one big sucker in bioinformatics, it has to be file formats.
One of the primary design goals for Basepair was to mitigate the complexity around filetypes as much as practically possible. Basepair is not designed around files, but around samples.
Want to run expression count on your RNA-Seq data, just select the sample. [We will figure out if the input is fastq or bam]. Want to run differential expression, just select 2 groups of samples and we will automatically pick the expression count file. The output will be and reformatted for further next downstream software you may need.
Think about your data, not file types!