The Problem

Bioinformatics has hit a ceiling. There are only three ways to analyze the volume of genomic data being produced in the world’s research labs by today’s modern sequencers from Illumina and newer instrument manufacturers such as Element Biosciences, Complete Genomics and others. AWS and other cloud providers have done a tremendous job at making more and more compute power available to us at an increasingly affordable price. Increasingly sophisticated algorithms including AI are steadily improving our ability to automate processes and sample volume. But unless something is done to enable more people to do the actual work, especially for routine analyses, I would argue that we won’t be able to keep up and the wealth of knowledge hidden in the data will remain undiscovered.

The exponential growth we have seen in the amount of data we are able to create, acts as a double-edged sword.  On the one hand, it allows for rapid improvements in technology and consequently our lives.  The phone you might be reading this on is orders of magnitude more powerful than the computer that brought humans to the moon.  On the other hand, exponential growth creates a massive discrepancy between the amount of data we have created, and the amount of insights we have gained from that data.  

Ask anyone well versed in a big data field, they will tell you we can now easily create way more data than we can reliably analyze.  So we have enormous amounts of information available to us, but no great way to make sense of it.  

Genomics is the big data field for biologists and it is growing incredibly fast.  For a real-world example of this, look at what ETF was the fastest-growing ETF in 2020.  That would be ARKG (Ark Genomics Revolution ETF).  This is not just some hypothetical problem, this is a real-world problem with real-world applications and the speed of progress has been limited because of solvable limitations to genomic data analysis.

An (overly simplified) example of the old way of working

So how is this data created and subsequently analyzed today?  First, a research scientist at a university, research institute or a biopharma company comes up with a testable hypothesis.  Let’s say they believe a (made-up) gene called XYZ downregulates (decreases) the expression of another gene called p53 (a cancer-fighting gene).  

Then they have to generate biological samples.  So they raise some mice with XYZ knocked out (deleted) and some mice with XYZ still functioning.  They take tissue samples from both populations of mice, perform some chemical manipulations to extract and amplify the genetic material, and send those samples to be sequenced.  They get raw data back from sequencing but the raw data is not interpretable by itself, it is just a series of letters (ACGT) that number in the billions.  No human can understand it in that form.  

If they are like most researchers they send that raw data to a Computational Biologist (bioinformatician) for analysis.  At a university, there may be a Bioinformatics core, or at a biopharma company, a Bioinformatics department.  But while there may be hundreds of labs at a given university, with multiple different experiments going on at the same time, the Bioinformatics core may have at best five or ten people working in it.  

The bioinformaticians are doing the best that they can, they work through the data diligently and return the results in excel files with gene expression levels, and a few PDFs with graphics like heatmaps and volcano plots.  Just that first pass can take weeks while the research scientist waits to see if their hypothesis has any merit to it.  

But that’s not the end because great research is iterative.  The scientist may see something interesting in the initial reports that they want to explore in more detail or ask a follow-up question by interrogating the data further or they simply need a graph to look slightly different for a publication.  

So they go back to the bioinformatician who has already started working on the next lab’s data — As they should!  They’ve got 8 more labs waiting behind them for data analysis.  So the original researcher has to wait another four weeks to get an answer and by then they may have lost the insight.  Or, someone else has now published a paper looking at the same question and this researcher just wasted months of their time with nothing to show for it from a publication perspective.  

Every other step in this research process has been and is continuously optimized.  There are lab companies where you can buy mice with genes already edited.  There are Library Prep Kit manufacturers who sell “kits” of the chemicals you need to extract and sequence the genetic material you are interested in.  There are service providers where you can send your samples for sequencing if your sequencing core is backed up.  There are even full-service companies that will do every step of this process for you, but they still rely on these specialist computational biologists to manually analyze the data.

And this is not a post to rag on bioinformatics scientists. They are doing a superhuman job to work with multiple researchers and crank out as much processed data as they can.  But you can’t solve an exponential data problem with sheer manpower.  In genomics, the data scientists are at max capacity and they can not give the data from every experiment the attention it deserves.  They also don’t have the same emotional attachment and scientific curiosity for each experiment as the bench scientist who put months of thought and effort into it.

Section Title:  The Conclusion

The solution? We need to empower the bench scientists to analyze their own data so they can put the time in, and ask questions of it, and iterate on their research more quickly.  We need to democratize bioinformatics.  

I remember speaking with a computational scientist who said something along the lines of, “We cannot trust a bench scientist to know how to analyze their own data.”  I say, “We must!”.  Who else knows the biology of their experiment better than them?  Who else cares about that specific gene in that specific research organism to try to treat or cure that rare disease as much as them?  Sure, it is complicated with different tools and input parameters and edge cases and all of those problems, but what other option do we have?  

I have just an undergraduate degree in biology working in the field for less than four years, but I can tell you that for most cases STAR is a better alignment tool for RNA-seq data than Tophat.  I can tell you that GATK4 is slower but more accurate than Freebayes for variant detection.  And I can tell you that Juicer desperately needs parallelization and optimization for Hi-C data analysis. If I can get to this admittedly high-level under standing, then the brilliant biologists I work with can be trusted to run routine bioinformatics analyses themselves.  

I speak to research scientists with Ph.D.’s in genetics all day and I can also tell you that most of them see bioinformatics as a gigantic black box.  The people who generate the samples and interpret the end results don’t understand how the data is analyzed.  That is a disconnect that we cannot accept if we want to get the most out of this field that has incredible potential to save lives, improve our health, reduce suffering, and so much more.  

Remember when I mentioned before that this is not a rant about bioinformaticians?  You may be wondering what their role will be in this brave new world where bench scientists can analyze their own data.  Well, they can focus their time and energy on bigger and much more interesting problems rather than running the same analysis pipelines over and over on raw data.  

They can develop new tools and new pipelines for the shiny new data types that are being produced in genomics all the time.  They can work on integrating data types so we can see how epigenetic changes interact with expression data or variant data.  And they can help with those edge cases where the bench scientists will still struggle because in a growing field like genomics you can bet those edge cases will always exist.  There are so many fascinating and important questions for computational biologists to work on and it will be a vastly superior use of their time and knowledge.

The genomics field has come a long way and it has done so much good in the world (see: Covid vaccine out in less than a year) but the journey is far from finished.  It is time to build a new era of bioinformatics that makes the most of the genomic data being generated every day.  As you might see in any industry solving complex computational problems, there are commercially available softwares that attempt to solve this, each with their own unique pros and cons (look out for a follow-up blog breaking down those differences)But if you want to hear specifically about Basepair’s solution, or you want to share your own opinions on the assumptions and conclusions I have made today, feel free to reach out to me directly at sam@basepairtech.com with your thoughts.  Whether you believe I hit the nail on the head or you think I am completely out of my mind, I would love to hear from you.  I just ask that if it’s the latter, please say it nicely.  Because this article, unlike many others being posted these days, was written by a real human and not ChatGPT.