Comparing common compression tools using real-world data¶
Compression tools can significantly reduce the amount of disk space consumed by your data. In this article, we will look at the effectiveness of some compression tools on real-world data sets, make some recommendations, and perhaps persuade you that compression is worth the effort.
Compression on a single processor core¶
Lossless compression of files is a great way to save space, and therefore money, on storage costs. Not all compression tools are equal and your experience will vary depending of which of the wide range of available compression tools you use. There is also a historical perception that compression and decompression are slow and time consuming, introducing unnecessary delays into the workflow.
Although you will need to decompress any compressed data before it is used, decompression can be very rapid and most of the time won't introduce significant delay to your workflow.
We will confine all of the tools to use a single core, to enable direct comparison.
Tools¶
We will be performing tests using the following popular open-source tools:
-
gzip v1.10 (compiled from source)
-
xz v5.2.2, (CentOS repository)
-
bzip2 v1.0.6 (CentOS repository)
-
pigz v2.4 -a parallel implementation of gzip (compiled from source)
-
zstd v1.4.1 -touted as a fast compression algorithm (compiled from source)
Other tools considered include lrzip
and lz4
.
Some of these tools offer multi-threaded options, which have stunning results
in conjunction with the QMUL HPC Cluster, where typically the Research Data
resides.
Datasets¶
Two datasets were selected for testing:
- a human genome GRCh38 reference assembly file, size ~3.3GB uncompressed
- the Linux kernel v5.2, size 831MB uncompressed
The Human Genome reference file only contains the characters {g,a,c,t,G,A,C,T,N} but is an example of a genome reference file commonly used in bioinformatics. Effective compression for bioinformatics is important due to frequent use of large data files. The scope of this article is to investigate the efficacy of a range of generalist tools to work with a variety of datasets, while bearing in mind that other specialist tools may produce better compression for a narrow range of data types. For example, GeCo compressed the human genome file to around half the size of the general tools. However, these specialist tools carry risk; some of the projects are abandoned, and produce proprietary formats, and don't necessarily decompress to a file identical to the original! It's better to stick to popular open-source tools under active development.
The Linux kernel has over 20 million lines of code, mostly heavily commented C language.
The files were obtained via the following Linux commands, and subsequently uncompressed onto high-performance networked NVME storage.
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.fna.gz
wget https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.2.tar.xz
Method¶
Compression tests were run on a QMUL HPC Cluster
node with Intel Gold 6126 Skylake
Processors running CentOS 7.6. Single-core tests were run while the machine was
idle, to avoid any thermal throttling from other computations running on the
machine. For each compression tool, the test data was compressed once for each
available compression level, observing runtime and resulting file sizes using
time
and ls
commands. The runs were repeated to confirm results.
pigz
and zstd
were constrained to use one thread for the single-core tests,
since they attempt to use more cores by default.
Definitions¶
The compression speed is defined as: (uncompressed data size in GB)/(time taken to compress in seconds)
Decompression speed is defined as : (uncompressed data size in GB)/(time taken to decompress in seconds)
Single-Core Test Results¶
Compiler optimisation¶
Some preliminary tests were run to compare gzip
binaries compiled with and
without gcc compiler optimisation -O3 -march=native
. There was on average, a
10-20% performance loss when compiling gzip 1.10
from source with gcc
,
without any compiler flags, versus gzip 1.5
available in the CentOS
repository, and gzip 1.10
with -O3 -march=native
flags.
The CentOS rpm package had been compiled with the following standard CentOS toolchain flags:
-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong \
--param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic
For the following tests, we continued with gzip 1.10
compiled with the gcc
optimisation.
Genome dataset¶
Speeds¶
zstd
was fastest of the tools tested. zstd
in fact offers 19 compression
levels, the first 9 are shown on this graph only.
Most decompression is single-threaded, except for pigz
which can offer
improved performance if allowed to use two cores. Both single-core and dual-core
pigz
results are shown in the chart below. zstd
was the clear winner for
decompression speed, offering around ten times the performance of bzip2
and
xz
File size¶
xz
was able to deliver the smallest file size for the compressed genomics
data, however at a cost of speed. gzip
and pigz
both result in the same
file size and are only represented once in the below chart; the only difference
is the computational performance.
Linux kernel dataset¶
Processing speeds¶
The Linux kernel dataset contains a lot of C Code and English text. It serves well as an example of a large code base, and should compress and decompress better than the genome file.
zstd
performed very well again, compared with the alternatives, with bzip2
and xz
the slowest. When considering the fastest option for each tool, at
level 1 compression, zstd
was able to produce a file 20.5% of the original
size in a blistering 2.18 seconds. At the other end of the scale, bzip2
took
56 seconds on its fastest setting, producing 17.5% compression. The fastest
setting of xz
took 29 seconds and produced around 18% compression.
The Linux kernel dataset compressed at a higher rate than
the genomics data. zstd
out-performed the other tools again, taking around 1.4
seconds to decompress, with xz
and bzip2
the slowest, taking around 8
seconds and 17 seconds respectively - still not bad for a 800MB uncompressed
file.
Compressed file size¶
The kernel dataset compressed more effectively than the genome dataset, due to
its more varied content. xz
produced the smallest compressed files with an
excellent 12.19% of the original file size, taking 314 seconds, with zstd
getting close at the higher levels, producing a 13.02% result in 275 seconds
with single-core confinement.
Comparing size/time showed the versatility of zstd
across the compression
levels, able to provide fast, and high levels of compression, and sometimes
both at once. xz
reached the parts that others can't reach in terms of
compression size, but with a trade-off against time. bzip2
was not
particularly versatile, operating in a narrow time/size range. gzip
, while
quite fast, couldn't produce small enough file sizes compared to other tools.
Default compression levels¶
Compression tools offer quite varied results based upon the level of compression selected by the user. The default level (i.e the level chosen by the tool if not otherwise specified) is rather arbitrary and may not be best for your use case, so it's good to appreciate that other options may be more suitable - choosing a lower compression level will still give you fairly good results and very short computation time, or conversely, if you are very constrained on space, crank up the levels and wait a bit longer for your results.
The charts below demonstrate the varying performance of the tools when
selecting the default. Note that pigz
, while based on gzip
, and produces
compatible .gz
files of the same size, which can also be uncompressed with
gzip
, decompressed considerably better. However, zstd
can also be used to
compress and decompress .gz
at speeds comparable to pigz
, in addition to
its .zst
format.
The main take-away is to choose the best level for your situation, rather than assuming the default option is best.
Summary¶
In spite of the charts showing varying performance based on the choice of tool, it's good to get some perspective. The 800MB kernel file compressed to between 12%-24% of its original size, in times ranging from 2-314 seconds with single-core compression. Each compression tool had a range of only around 5% file size difference between its slowest and fastest compression level.
For sites offering files for download, where bandwidth and storage are provided
at cost, xz
format produces small file sizes, although xz
was one of the
slower formats for decompression.
For large files that are accessed often, fast decompression may be be your driver.
zstd
is an incredibly versatile tool, and performed very well in terms of
performance - the best all-rounder, it coped well with source code, and tricky
genome data. In addition to producing .zstd
files, with use of the --format
option, it can operate on .gz
and .xz
files too (and lz4 format, , with
comparable performance to pigz
and xz
, enabling you to use a single tool to
handle most of the compressed files you encounter.
If you are already compressing your larger data files as part of your workflow,
well done. If you having traditionally been using bzip2
or gzip
, maybe
consider whether you will get better performance from another tool? If you
aren't currently compressing your larger files, maybe you could consider it -
hopefully we've persuaded you that it's not as laborious as you thought. Since
Research Projects pay for their Research Data Storage, it's worth considering -
it also has a knock-on effect - it reduces time to copy files over the network,
and reduces the number of tapes we need for backups.
Title image: v2osk on Unsplash
Human Genome image: dollar_bin