The performance of NGS alignment tools has continuously improved, but sorting the aligned data now takes more time than alignment. A big leap in sorting was recently initiated with SAMtools version 1.x, with support for parallel processing for sorting.

We compared the sorting speed of a 25Gb unsorted BAM file with SAMtools and sambamba. Our results show that sambamba was 2x faster than SAMtools. The following violin plot shows that SAMtools took 20 minutes while sambamba could sort the same file in 10 minutes. The narrow plot for sambamba indicates that its performance is more predictable than SAMtools.

 

 

Sorting bam files, samtools vs sambamba
We were curious why SAMtools took twice as much time as sambamba, so we plotted the CPU and memory usage for both applications.

 

 

Despite supporting multiple threads, SAMtools is not very good at parallelization. For the first half, SAMtools was using just a single thread with an occasional spike in CPU usage. The CPU usage was little over 20% for the second half. Sambamba used 30-40% CPU for the first half, and then over 90% for the second half.

 

 

CPU usage for sorting bam files, samtools vs sambamba
SAMtools did better in memory utilization, steadily consuming ~ 50 Gb memory after ramping up. While running SAMtools, we provisioned only 45 Gb (1.5 Gb for each of the 30 threads) so one should only specify 80-90% of available memory to SAMtools. Sambamba used close to the 45 Gb memory we specified for the first 5 minutes before dropping the memory used to 2Gb.

 

 

 

 

Memory usage for sorting bam files, samtools vs sambamba
These data suggest that sambamba sorts BAM files faster due to better utilization of multiple processors.

 

 

 

Methods

The tests were run on AWS instance c3.8xlarge (32 cores, 60 Gb RAM) and the files were stored in local storage. The unsorted BAM file was generated by STAR.

SAMtools (version 1.2) and sambamba (version 0.6.3) were run 10 times each, alternatively, to reduce any bias. The applications were run with the following options:

sambamba sort -t 30 -m 45G -o Input.hg19.sambamba-sort.bam Input.hg19.Aligned.out.bam

samtools sort -@ 30 -m 1500M -T __sam_tmp__ -o Input.hg19.samtools-sort.bam Input.hg19.Aligned.out.bam

The CPU, memory and disk usage were collected using dstat at 15 second intervals. The plots were generated using Python packages seaborn or matplotlib.

Want to try Basepair’s NGS analysis pipelines? Sign up for a 14-day trial and run unlimited analyses on up to 6 samples for free.