Introduction
Congratulations on your investment in a new sequencing instrument! You now hold the key to unlocking the secrets hidden within the vast world of genomics. But the journey doesn’t end with data generation; it begins there. To effectively harness the potential of your instrument, you need a powerful, user-friendly, and secure solution for NGS (Next Generation Sequencing) data analysis & visualization. If, like many others, you have chosen one of Complete Genomics market-leading instruments, they offer a suite of on prem solutions including ZTRON Lite and Pro to perform your analysis. However, more and more life sciences organizations around the world are turning to cloud computing as an attractive alternative for scaling workloads as it has many benefits. These include obviating the need for up front capital investment, as well as the attraction of paying only for what you need and use as opposed to having to purchase for peak usage in mind. This blog attempts to introduce you to some of the more important aspects to consider when leveraging cloud computing to scale, deploy and orchestrate analysis workflows.
One of the most popular cloud infrastructures is Amazon Web Services (AWS) and its Amazon Elastic Compute Cloud (Amazon EC2) provides secure and resizable compute capacity for virtually any workload. It efficiently processes the large datasets that are regularly generated as a result of NGS experiments where the size of a single sample ranges to several hundred gigabytes. Ultimately, the way you choose to analyze your NGS data can greatly impact the outcome of your research. Running NGS workflows natively on AWS can be a powerful approach and represents an exciting opportunity to accelerate time to scientific insight in a cost effective way. However, what is often underestimated is that efficient NGS workflow implementation, deployment and orchestration on AWS can often require an investment of time and resources, as well as careful planning and attention to detail. In this blog, we’ll briefly list out some of the more important aspects you need to consider when building out your NGS data analysis infrastructure directly on AWS, as well as the essential roles and skill sets needed to make this a reality. Not sure you have the time and resources needed in house? Never fear! Help is at hand with Basepair, a Software-as-a-Service (SaaS) bioinformatics platform approach to NGS analysis, that can get you up and running in a fraction of the time without all of the cloud computing engineers and DevOps specialists that would otherwise be needed.
Building an Infrastructure Yourself Directly on AWS
AWS Account and Infrastructure Setup
When first getting going with AWS, you will need to create an AWS account or use an existing one that has already been set up by your organization, paying particular attention to ensuring that the account is properly configured and linked to your organization’s billing information. As part of the infrastructure you will need to establish a Virtual Private Cloud (VPC) and network architecture that aligns with your organization’s security and data segregation requirements.
Computational Resources
Next you’ll need to make some decisions around the computational resources to use for each type of analysis, including the CPU, memory and storage required. Whilst AWS has a number of EC2 instance types to choose from, it isn’t always easy to know which is the best one for each NGS tool or workflow. In addition, you’ll likely want to take advantage of what are called AWS ‘Spot Instances’ which are a cost-effective pricing option for cloud computing resources that can significantly reduce the cost of running workloads in the AWS cloud. They allow you to bid for the spare AWS compute capacity at a much lower price compared to On-Demand or Reserved Instances. However, the catch is that they are “spare” capacity, and AWS can reclaim them with short notice (typically two minutes) if the capacity is needed by someone else who’s willing to bid a higher price. This means your NGS workflows should be designed with fault tolerance in mind. Spot Instances can be terminated with little notice, so your workloads should be able to handle interruptions gracefully.
Data Storage and Management
The other major infrastructure required on AWS is data storage. Most life sciences organizations working with NGS data have traditionally used Amazon S3 (Simple Storage Service) to ensure durability, availability and scalability for these large sets. Whilst it is relatively quick and straightforward to set it up, it is important to put some thought into a well-structured data storage strategy to make data organized in a way that makes it readily accessible to the individuals with the appropriate permissions and access rights. Moreover, putting in place a robust data back and up and versioning plan is essential to safeguard against data loss and accidental changes.
Amazon HealthOmics
Even though S3 has typically been the way to go for Life Sciences Organizations, more and more are now turning to Amazon HealthOmics as a way to not just store, but also analyze and derive insights from NGS data. In a nutshell, Amazon HealthOmics is a suite of specialized and secure data storage and analysis tools designed for the healthcare and life sciences industries. It aims to simplify the management of complex genomics, health, and life sciences data while ensuring data privacy, security, and compliance with industry-specific regulations. It differs from traditional more generic AWS tools for storage and data analysis by being specifically tailored for healthcare and life sciences data, going beyond what these traditional tools were able to offer. For example, whilst AWS S3 can be configured for HIPAA, GDPR and other compliance standards, Amazon HealthOmics comes with built-in compliance features and security protocols, simplifying the process for setting up secure, compliant environments on AWS. All in all, whilst Amazon HealthOmics represents a significant leap forward in healthcare and life sciences data management and analysis, it still requires domain experts familiar with AWS tools to set up, manage, extend and maintain workflows and solutions that are built on top of HealthOmics.
Security and Access Control
Next it’s time to start thinking about how to ensure that your data is secure and that only people with the correct permissions have the ability to access it. This is all the more important with genomic and associated data which is often considered PHI (personal health information). The subject of data security and governance could quite rightly be the topic of an entirely separate blog post, but suffice to say that as part of this step, things like IAM (Identity and Access Management) roles and policies to control who can access resources and perform specific actions need to be put in place. The data itself also needs to be encrypted at rest and in transit to enhance data security, as well as the appropriate monitoring and auditing of transactions to track access and detect any potential security breaches or unusual activities.
Installation of NGS Analysis Tools and Pipelines
Once these basic infrastructure elements have been considered and then implemented, the appropriate NGS tools and software can be installed for the data type of interest. However, this is often easier said than done. Choosing the right NGS analysis tools can be a daunting task, especially for those who are new to NGS or to a specific data type, largely due to the sheer diversity of data types, experimental designs and the ever-evolving landscape of bioinformatics tools. Particular attention needs to be paid to software dependencies, particularly between versions, in order to ensure compatibility, operational availability and reproducibility. Nevertheless, by staying informed, benchmarking, collaborating, and iterating, bioinformaticians can go a long way to unlocking the potential of the genomic data being generated, turning raw information into invaluable insights that drive scientific discovery and innovation.
Workflow Automation
Next up comes workflow automation. Whilst analyzing a few samples here and there might be able to be performed manually, with the exponential growth of genomic data, the need for efficient, reproducible and error-free analysis has become paramount. This is where workflow automation in NGS analysis comes into play as a way of navigating this complex terrain. Indeed, by embracing automation, researchers gain speed, efficiency, and reliability, allowing them to focus on the most critical aspect of their work—discovering the secrets hidden within the vast genetic code. Automation isn’t just a tool; it’s the key to unlocking the true potential of genomics research. This starts with automating the analysis and data processing steps using workflow management tools like Nextflow and WDL, configuring and connecting the various tools to be used for steps such as QA/QC, alignment, variant calling, etc.
Monitoring and Optimization
Once workflows are up and running, it’s important to consider implementing a monitoring and alerting system such as with AWS CloudWatch to keep an eye on EC2 instance performance and resource utilization. As discussed previously, NGS data analysis involves a series of intricate steps, from data preprocessing to variant calling and interpretation. These workflows can be computationally intensive and generate substantial data. Given the diverse nature of NGS projects, no single set of parameters or resources fits all. Continual monitoring and optimization are essential to adapt to the evolving demands of each analysis and to ensure their efficiency and cost effectiveness and should be regularly reviewed to identify and address inefficiencies.
Data Sharing and Collaboration
Now the raw data has been (efficiently) processed using best practice secondary analysis tools, there is typically a need to share it with colleagues, collaborators and even customers. This can be particularly challenging when using an on premise solution for NGS data analysis as there are the intricacies of firewalls to overcome that can often slow down or even interrupt data transfers, especially when dealing with large data sets which is often the case with NGS. Whilst cloud infrastructures greatly simplify this step, it still requires effort to set up secure collaboration protocols to enable efficient teamwork and data sharing among researchers. IAM policies and roles often need to be extended to collaborators and kept a track of, ensuring only those who have the appropriate permissions to access certain data sets are actually able to.
Documentation and Knowledge Sharing
Unfortunately setting up NGS workflows isn’t a once and done. Best practice for any developer of analysis pipelines is to ensure and maintain comprehensive documentation for how they work, the resources required and any interdependencies that may be important to the functioning of a larger solution. This includes pipeline specifications, installation guides and data storage strategies, not to mention a requirement to then sufficiently resource the support team needed to answer the inevitable questions that will come from users of these new tools that are now available, especially if they are being accessed by non computationally savvy users in the various R&D teams. It’s also important to consider how to do this in a way that enables business continuity. The last thing you want is for employees with specialist knowledge of how everything works to leave the organization and take that knowledge with them. Make sure that you have a way of documenting how the various pieces have been put together and how they work with each other so that someone else can pick things up relatively quickly as employees come and go.
Resourcing All of this Requires a Dedicated Team of Individuals
OK, these are the important considerations when setting up a cloud infrastructure to efficiently process your NGS in a cost effective and secure way. All in all, depending on what an organization has already hired for, all of this often requires putting in place a team that consists of all or some of the following individuals and/or skills sets:
- Cloud Architects to design and manage your cloud-based system
- DevOps Engineer for smooth, automated workflows that leverage tools like AWS CloudFormation and CI/CD pipelines to automate tasks, reduce errors, and make deployments a breeze
- Bioinformaticians to understand the NGS tools, workflows and data pre processing that is going to be required for the type of data being generated, not to mention interpreting the results in order to extract the maximum value from the data
- Systems administrator to keep your AWS infrastructure running smoothly and making sure the storage and compute resources are always available when you need them
- Security specialist to ensure that your data is protected and complies with the necessary regulations
- Data manager responsible for data governance, organization and backup strategies in order to ensure data integrity and accessibility so you can always find what you need
- Cost management specialist who understands AWS billing in order to help keep your spending in check
With all this in mind, it should come as no surprise to hear that whilst being a potentially rewarding endeavor, setting up a cloud infrastructure directly on AWS can often take some time and resources, even if NGS-optimized tools such as those including in the AWS HealthOmics suite are being used. On the flip side, by design it certainly offers a greater degree of customization & control. Therefore if life sciences organizations see analysis as part of their secret sauce or potentially a way of differentiating themselves from the competition, there is merit to considering this approach. If on the other hand the bioinformatics infrastructure itself is not seen as a differentiating factor for an organization’s core business, or if a high degree of customization is not paramount, then providing it has certain key characteristics, a platform approach can greatly accelerate the time needed to migrate and deploy workflows into production.
The Basepair Bioinformatics Platform Approach
In the fast-evolving landscape of genomics and NGS, sequencing instrument manufacturers such as Complete Genomics are always seeking new ways to provide their customers with tools and services that simplify the NGS analysis process in order to enhance the value of the data being generated. One such example is by offering their customers a bioinformatics platform approach to efficiently analyzing & visualizing NGS data in the cloud, particularly on AWS that has a significant part of the cloud computing market and is being used by the majority of customers. At a high level, a platform approach to running bioinformatics workflows on AWS abstracts away many of the resources needed to set up the hitherto mentioned infrastructure, accelerating migration, deployment and scaling of NGS workflows, whilst potentially offering a wealth of other advantages related to efficiency, resource management and cost optimization.
As we have seen already, building such a software platform from scratch would be a significant undertaking for any instrument manufacturer, not to mention the time it would take to release something that could be used in production by customers. Looking at all of the platforms available in the market today, Basepair, a pioneering Software as a Service (SaaS) bioinformatics company, has emerged as a frontrunner in this endeavor. For those unfamiliar with its groundbreaking approach, here is a high-level look at the benefits of leveraging Basepair for a rapid, secure, cost effective way of analyzing NGS data on AWS.
Seamless Integration with AWS
In this digital age, data security and compliance are paramount. Basepair’s close partnership with AWS and seamless integration with its tools and services provides an ideal solution for organizations working with NGS data. For customers who don’t have their own AWS account, data can be uploaded to Basepair’s securely hosted solution through a drag & drop approach, direct integration with Basespace, CLI/API or from an ftp server. What really sets it apart however, is that for those organizations who would prefer to analyze their data securely within their own AWS accounts, Basepair’s platform can be configured to leverage their own compute & storage resources, ensuring compliance with local data residency regulations and providing complete control over their data. Most importantly, this approach enables users to leverage the scalability and computational power of the cloud while ensuring maximum data security and integrity. The fact that Basepair also happens to be SaaS means it also comes with an ultra-low operational burden without the need for installation inside an AWS account, significantly reducing the time and effort needed to support and maintain the solution.
Usability by Researchers of all Backgrounds
As previously discussed, NGS data analysis can be a formidable task, requiring a deep understanding of bioinformatics, computational resources, and complex command-line tools. For many researchers and bench scientists, this poses a steep learning curve and can be a barrier to running more sequencing experiments. Basepair’s graphical user interface (GUI) is designed for use by bench scientists who want to take a low code/no code approach to bioinformatics. As such, it is ideal for organizations wanting to enable research scientists with little to no computational background to run routine analyses, freeing up bioinformatics time to spend on more advanced, and arguably more valuable data interpretation tasks. This being said, whilst it takes under 30 minutes to learn how to perform your own analyses through Basepair’s point & click GUI, everything a user does through that is API driven on the back end, meaning that, and more, can also be done from the command line should advanced users want more control over the way that pipelines are run.
Interactive Data Visualization
Static data reports and downloadable flat files are no longer sufficient for the complex insights researchers seek. Basepair not only provides automated NGS analysis using industry standard best practice tools such as BWA and GATK, it also comes with reports for interactive data visualization optimized for each data type that empower users to explore and interpret their NGS data dynamically. All of the images and visualizations that come as part of these reports are downloadable in a high res svg format ready for publishing and including in presentations.
Support for Multiple Data Types
Organizations can no longer afford to purchase multiple solutions for making sense of their NGS data. Basepair comes with out-of-the-box support for all the main application areas including genetics (WGS, WES, panels, CRISPR), transcriptomics (bulk and single cell RNASeq) and epigenetics (ATACSeq, CUT&RUN/TAG, ChIPSeq etc), with direct support for running standard workflow languages, e.g, Nextflow, WDL, etc., and the ability to deploy cGplatform if needed.
Cost-Effective Scalability
Data analysis requirements in genomics can vary significantly depending on project size and sample volume. Basepair’s cloud-based platform allows for cost-effective scalability ranging from a pay-as-you-go usage model with no upfront license fee for customers with smaller numbers of samples, up to annual license fees that afford a much lower price per sample for those with larger sample volumes. Furthermore, connecting Basepair to an organization’s own AWS account enables them to benefit from economies of scale and any credits offered by their cloud provider as well as enabling Basepair to offer a larger number of samples in its annual license structure for no additional cost.
As an additional consideration, customers can further optimize and simplify the cloud resources needed to power interactive analysis and NGS pipeline execution via an integration between Basepair and MemVerge’s Memory Machine Cloud. MemVerge’s software brings together the best of cloud resource automation with modern checkpoint and recovery techniques to enable interactive analyses and NGS pipeline execution at 60% lower compute cost. These cost savings are achieved through a performance file system tuned for NGS analysis and the use of EC2 Spot instances with built-in checkpoint and recovery, eliminating the risk of losing your work or having to repeatedly run batch jobs from the beginning due to Spot reclaims.
To supplement these standout, differentiating capabilities, Basepair also comes with all of the features you might expect of an industry leading bioinformatics platform, including real-time collaboration, compliance & data security (such as HIPAA for clinical data or GDPR for international researchers) and others.
Conclusion
Organizations seeking to deploy NGS workflows face a choice: managing everything directly on AWS or utilizing a bioinformatics SaaS platform approach such as the one offered through Basepair. Both options have their advantages and disadvantages, and the choice depends on specific organizational needs and priorities. Ultimately, the choice should align with the scale and nature of your research, what you consider as providing a competitive advantage, along with the time and resources you feel you can make available to put towards a successful cloud computing infrastructure. No single approach is universally superior, but understanding the pros and cons of each can help you make an informed decision. Consider your research goals, timeline, budget, and available expertise to select the most suitable approach for your NGS data analysis. By carefully considering and implementing these critical elements, organizations can harness the full potential of cloud-based genomics research while maintaining data security, accessibility, and cost-effectiveness.