Speed Data File Compression With Pigz or Pbzip2

The pigz and pbzip2 commands are available on AFRL DSRC allocated systems.

Pigz

Pigz ("pig-zee") is a data file compression utility that can dramatically speed up compression on a compute node.

To use pigz, you must first load the module.

module load pigz

Usage

Pigz is a parallel implementation and a fully functional replacement for gzip. Pigz exploits multiple cores when compressing data files. By default, pigz will use all available cores on the node on which it is running.

Syntax:

tar -cf filename.tar.gz --use-compress-program=pigz directory_name

Example 1: Using pigz with the tar command to compress a directory. This syntax uses all the cores available to the node to perform the compression. Do this only within a compute node.
tar -cf myfilename.tar.gz --use-compress-program=pigz myfilename

Example 2: Using pigz to compress files through a shell pipe. This method is slower, but can be done on the interactive nodes as long as the number of cores is kept low (x=4).
tar -cf myfilename ./myDir | pigz -p x > myfilename.tar.gz

Note: Replace x with the number of cores pigz should use. 4 seems optimal for interactive use.

Observations

Since pigz uses all available cores on a node to compress files, pigz should be invoked on a compute node and not on an interactive node unless you limit the number of cores pigz can use by using the "-p" option listed above in example 2. The general idea is to use a dedicated compute node to generate compressed files faster than standard tar/gzip. It will also avoid impacting interactive users.

Pbzip2

Pbzip2 is a parallel implementation of the bzip2 block-sorting file compression utility and achieves near-linear speedup on multi-core compute nodes.

To use pbzip2, you must first load the module.

module load pbzip2

Usage

Pbzip2 is a fully functional replacement for bzip2. By default, pbzip2 will use all available cores on the local compute node.

Syntax:

tar -cf filename.tar.gz --use-compress-program=pbzip2 directory_name

Example 1: Using pbzip2 with the tar command to compress a directory. This syntax uses all the cores available to the node to perform the compression. Do this only within a compute node.
tar -cf myfilename.tar.gz --use-compress-program=pbzip2 myfilename

Example 2: Using pbzip2 to compress files through a shell pipe. This method is slower, but can be done on the interactive nodes as long as the number of cores is kept low (x=4).
tar -cf myfilename ./myDir | pbzip2 -p x > myfilename.tar.gz

Note: Replace x with the number of cores pbzip2 should use. 4 seems optimal for interactive use.

Observations

Since pbzip2 uses all available cores on a node to compress files, pbzip2 should be invoked on a compute node and not on an interactive node unless you limit the number of cores pbzip2 can use by using the "-p" option listed above in example 2. The general idea is to use a dedicated compute node to generate compressed files faster than standard tar/gzip. It will also avoid impacting interactive users.