Skip Nav

Speed Data File Compression With Pigz or Pbzip2

The pigz and pbzip2 commands are available on our Lightning, Predator, Spirit, and Utility Server systems.

Pigz

Pigz ("pig-zee") is a data file compression utility that can dramatically speed up compression on a compute node.

Usage

Pigz is a parallel implementation and a fully functional replacement for gzip. Pigz exploits multiple cores when compressing data files. By default, pigz will use all available cores on the node on which it is running.

Syntax:

tar -cf filename.tar.gz --use-compress-program=pigz directory_name

Example 1: Using pigz with the tar command to compress a directory. This syntax uses all the cores available to the node to perform the compression. Do this only within a compute node.
tar -cf myfilename.tar.gz --use-compress-program=pigz myfilename

Example 2: Using pigz to compress files through a shell pipe. This method is slower, but can be done on the interactive nodes as long as the number of cores is kept low (x=4).
tar -cf myfilename | pigz -p x > myfilename.tar.gz

Note: Replace x with the number of cores pigz should use. 4 seems optimal for interactive use.

Recommendation

Copy your data to $CENTER, then login to the Utility Server to post process. When complete, start an interactive batch session (via "qsub -A $ACCOUNT -l") and execute the first tar command in Example 1 above, or add it to your job script. Then archive the results.

Example 3: Compressing files using a job script.

#!/bin/csh
#PBS -A $ACCOUNT
#PBS -N compress_results
cd $CENTER
tar -cf myfilename.tar.gz --use-compress-program=pigz myfilename
exit

Then archive the results from an interactive node.

cp myfilename.tar.gz $ARCHIVE_HOME

or

/usr/bin/rcp myfilename.tar.gz ${ARCHIVE_HOST}:${ARCHIVE_HOME}/

Observations

Since pigz uses all available cores on a node to compress files, pigz should be invoked on a compute node and not on an interactive node unless you limit the number of cores pigz can use by using the "-p" option listed above in example 2. The general idea is to use a dedicated compute node to generate compressed files faster than standard tar/gzip. It will also avoid impacting interactive users.

Pbzip2

Pbzip2 is a parallel implementation of the bzip2 block-sorting file compression utility and achieves near-linear speedup on multi-core compute nodes.

Usage

Pbzip2 is a fully functional replacement for bzip2. By default, pbzip2 will use all available cores on the local compute node.

Syntax:

tar -cf filename.tar.gz --use-compress-program=pbzip2 directory_name

Example 1: Using pbzip2 with the tar command to compress a directory. This syntax uses all the cores available to the node to perform the compression. Do this only within a compute node.
tar -cf myfilename.tar.gz --use-compress-program=pbzip2 myfilename

Example 2: Using pbzip2 to compress files through a shell pipe. This method is slower, but can be done on the interactive nodes as long as the number of cores is kept low (x=4).
tar -cf myfilename | pbzip2 -p x > myfilename.tar.gz

Note: Replace x with the number of cores pbzip2 should use. 4 seems optimal for interactive use.

Recommendation

Copy your data to $CENTER, then login to the Utility Server to post process. When complete, start an interactive batch session (via "qsub -A $ACCOUNT -l") and execute the first tar command in Example 1 above, or add it to your job script. Then archive the results.

Example 3: Compressing files using a job script.

#!/bin/csh
#PBS -A $ACCOUNT
#PBS -N compress_results
cd $CENTER
tar -cf myfilename.tar.gz --use-compress-program=pbzip2 myfilename
exit

Then archive the results from an interactive node.

cp myfilename.tar.gz $ARCHIVE_HOME

or

/usr/bin/rcp myfilename.tar.gz ${ARCHIVE_HOST}:${ARCHIVE_HOME}/

Observations

Since pbzip2 uses all available cores on a node to compress files, pbzip2 should be invoked on a compute node and not on an interactive node unless you limit the number of cores pbzip2 can use by using the "-p" option listed above in example 2. The general idea is to use a dedicated compute node to generate compressed files faster than standard tar/gzip. It will also avoid impacting interactive users.