AFRL DSRC: Speed Data File Compression With Pigz or Pbzip2

Speed Data File Compression With Pigz or Pbzip2

The pigz and pbzip2 commands are available on AFRL DSRC allocated systems.

Pigz

Pigz ("pig-zee") is a data file compression utility that can dramatically speed up compression on a compute node.

To use pigz, you must first load the module.

module load pigz

Usage

Pigz is a parallel implementation and a fully functional replacement for gzip. Pigz exploits multiple cores when compressing data files. By default, pigz will use all available cores on the node on which it is running.

Syntax:

tar -cf filename.tar.gz --use-compress-program=pigz directory_name

Example 1: Using pigz with the tar command to compress a directory. This syntax uses all the cores available to the node to perform the compression. Do this only within a compute node.
tar -cf myfilename.tar.gz --use-compress-program=pigz myfilename

Example 2: Using pigz to compress files through a shell pipe. This method is slower, but can be done on the interactive nodes as long as the number of cores is kept low (x=4).
tar -cf myfilename ./myDir | pigz -p x > myfilename.tar.gz

Note: Replace x with the number of cores pigz should use. 4 seems optimal for interactive use.

Observations

Since pigz uses all available cores on a node to compress files, pigz should be invoked on a compute node and not on an interactive node unless you limit the number of cores pigz can use by using the "-p" option listed above in example 2. The general idea is to use a dedicated compute node to generate compressed files faster than standard tar/gzip. It will also avoid impacting interactive users.

Pbzip2

Pbzip2 is a parallel implementation of the bzip2 block-sorting file compression utility and achieves near-linear speedup on multi-core compute nodes.

To use pbzip2, you must first load the module.

module load pbzip2

Usage

Pbzip2 is a fully functional replacement for bzip2. By default, pbzip2 will use all available cores on the local compute node.

Syntax:

tar -cf filename.tar.gz --use-compress-program=pbzip2 directory_name

Example 1: Using pbzip2 with the tar command to compress a directory. This syntax uses all the cores available to the node to perform the compression. Do this only within a compute node.
tar -cf myfilename.tar.gz --use-compress-program=pbzip2 myfilename

Example 2: Using pbzip2 to compress files through a shell pipe. This method is slower, but can be done on the interactive nodes as long as the number of cores is kept low (x=4).
tar -cf myfilename ./myDir | pbzip2 -p x > myfilename.tar.gz

Note: Replace x with the number of cores pbzip2 should use. 4 seems optimal for interactive use.

Observations

Since pbzip2 uses all available cores on a node to compress files, pbzip2 should be invoked on a compute node and not on an interactive node unless you limit the number of cores pbzip2 can use by using the "-p" option listed above in example 2. The general idea is to use a dedicated compute node to generate compressed files faster than standard tar/gzip. It will also avoid impacting interactive users.