Introduction to ATAC-Seq Analysis & Pipelines
Amit U Sinha, Ph.D
Last Updated: March 5, 2020
The ATAC-seq method
Assay for Transposase-Accessible Chromatin, more commonly referred to as ATAC-seq, is a high-throughput sequencing technology that relies on transposase to study chromatin accessibility at a genomic level . Chromatin accessibility is the measure of the physical contact of nuclear macromolecules with chromatinized DNA regulating gene expression . The NGS adapters are loaded onto the transposase, which fragments the DNA in open chromatin regions and incorporates NGS adaptors. The prepared library is then sequenced using any of the NGS platforms.
The ATAC-seq method was first discovered by Jason Buenrostro while looking for new methods to study open chromatin structure, nucleosomal position, and transcription factor positions . To overcome the limitations of previous methods, the entire team developed a new assay that helped analyze the overall epigenetic profile from a lesser amount of starting material (cells). This method depended on hyperactive Tn5 transposase used earlier by Nextera’s — now acquired by Illumina — sequencing approach.
ATAC-seq uses hyperactive Tn5 transposase to cut and ligate adapters to regions of increased accessibility. These regions are later sequenced using various NGS platforms. Initially, cells are collected and prepared for lysis, followed by transposition reaction and purification. In the next step, PCR amplification and library preparation are performed . Later, Illumina high-throughput sequencing is performed to get the sequencing reads in paired-end mode.
Unlike ChIP-seq, Dnase-seq, MNase-Seq, and FAIRE-seq, which require a larger number of cells as starting materials, ATAC-seq can utilize a smaller sample size. ATAC-seq is considered a faster and more cost-effective approach to study genome wide chromatin accessibility. It is also used to study nucleosome positions in accessible regions of the genome. Furthermore, it does not require a separate adapter ligation step, gel purification, and crosslink reversals .
Overview of the ATAC-seq pipeline
The ATAC-seq pipeline consists of several steps to process the raw data and get meaningful results. The first step in ATAC-seq data analysis is to perform the QC (Quality control) of raw reads. The fastp tool can be used to perform quality control, adapter trimming, quality filtering, and read quality cutting .
The next step involves mapping the trimmed reads to the reference genome of interest using a tool like Bowtie2. The alignment can be performed in end-to-end paired-end mode as the adapters are already trimmed. Before proceeding to the next step, mapping statistics can be obtained to see how many read pairs mapped concordantly. All unmapped reads need to be filtered out, as they are possible outcomes of sequencing errors.
ATAC-seq data contains many reads which align to the mitochondrial region of the genome It is nucleosome-free, and therefore very accessible to Tn5 insertion. All these reads, including reads with low quality and not properly paired reads, must be removed before proceeding to the downstream analysis.
The PCR amplification step sometimes amplifies the same original DNA fragment, which leads to multiple reads aligning to the same genomic position. This leads to overrepresented sampling of certain DNA fragments, which need to be removed before downstream analysis. To remove duplicates, the Picard tools MarkDuplicate program can be used. In the subsequent steps, insert size (distance between read pairs R1 and R2) is checked with Picard CollectInsertSizeMetrics. Checking insert size helps in the identification of fragment length distribution of a sample.
After obtaining pre-processed files from the earlier steps, peaks can be identified using MACS2. The peak calling step helps find regions corresponding to potential open chromatin regions. MACS2 outputs the called peaks in BED format, which contains peak coordinates in addition to fold change, p-values, and other statistics.
Overview of results
The coverage file obtained from MACS2 contains the details of the read coverage in peaks. This file needs to be converted into a bedgraph format so it can be visualized using a genome browser (e.g. IGV or UCSC browser). Genome browsers allow visualization of peak signal alongside genome annotations like transcription factors, promoters, exons, intergenic regions, etc. If the goal of the project is to see selected regions, a heatmap can be computed and is always included with Basepair’s ATAC-seq analysis reports.
1. Sun, Y., Miao, N., & Sun, T. (2019). Detect accessible chromatin using ATAC-sequencing, from principle to applications. Hereditas, 156(1), 1-9.
2. Klemm, S. L., Shipony, Z., & Greenleaf, W. J. (2019). Chromatin accessibility and the regulatory epigenome. Nature Reviews Genetics, 20(4), 207-220.
3. Buenrostro, J. D., Giresi, P. G., Zaba, L. C., Chang, H. Y., & Greenleaf, W. J. (2013). Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nature methods, 10(12), 1213.
4. Buenrostro, J. D., Wu, B., Chang, H. Y., & Greenleaf, W. J. (2015). ATAC‐seq: a method for assaying chromatin accessibility genome‐wide. Current protocols in molecular biology, 109(1), 21-29.
5. Chen, S., Zhou, Y., Chen, Y., & Gu, J. (2018). fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics, 34(17), i884-i890.