TFM:File compression

From ProgSoc Wiki

Jump to: navigation, search


File compression

Piers Johnson and Murray Grant

Compression is all about disk space - the saving of it. You are almost certainly aware of file compression software such as pkzip or WinZip, if a PC user, or StuffIt if you prefer Macs. Similar applications exist for Unix.

Compression Programs

The standard compression utilities for Unix are compress, gzip, and bzip2.[1] When used with an archiver such as tar, you've got a bunch of neato utilities.



    gzip [ -acdfhlLnNrtvV19 ] [-S suffix] [ name ... ]
    gunzip [ -acfhlLnNrtvV ] [-S suffix] [ name ... ]

Gzip reduces the size of the named files using Lempel-Ziv coding (LZ77).[2] Whenever possible, each file is replaced by one with the extension .gz, while keeping the same ownership modes, access and modification times. If no files are specified, or if the file name is "-", stdin is compressed to stdout. Gzip will only compress normal files, and will ignore symbolic links. If a compressed file's name is too long for the file system, gzip will truncate it. Gzip keeps the original file name and timestamp in the compressed file. Compressed files can be restored using gzip -d, gunzip or zcat.

gunzip can decompress files compressed with gzip, zip, compress, compress -H and pack. The detection of the compressi on method is automatic. zip'd files can only be decompressed if they only have one member. Use unzip for other files.

Important command line options are:

  • -a --ascii Ascii text mode. Converts CR & LF according to your system.
  • -c --stdout --to-stdout Stdin to Stdout.
  • -d --decompress Decompress.
  • -h --help Help. Shows the help screen and exits.
  • -q --quiet Suppress warnings.
  • -r --recursive Recurse Directories. If any specified files are directories, they will be recursed into and all the files there will be compress (if using gzip) or decompressed (gunzip).
  • -S .suf --suffix .suf Change suffix. Any suffix may be given (i.e. not just .suf), but beware of using suffixes used by other compression methods. A null suffix forces gunzip} to try to decompress all files, regardless of suffix. <tt>gunzip -S "" * (or *.* for DOS)
  • -t --test Test. Check the archive integrity.
  • -v --verbose Verbose. Display more information.
  • -# --fast --best Compression type. Compression speed is regulated by the digit specified where -1 or --fast is the fastest, least compressing method; and -9 or --best is the slowest but most compressed method.

Both gzip and gunzip have man pages, which have more information on their usage.

An example of using gzip and gunzip :

kaleid@niflheim:~/pub$ ls -l
-rw-r--r--    1 ligos    users       15662 Mar 10 17:00 tfm-unix.txt
kaleid@niflheim:~/pub$ gzip tfm-unix.txt
kaleid@niflheim:~/pub$ ls -l
-rw-r--r--    1 ligos    users        6865 Mar 10 17:00 tfm-unix.txt.gz
kaleid@niflheim:~/pub$ gunzip tfm-unix.txt.gz
kaleid@niflheim:~/pub$ ls -l
-rw-r--r--    1 ligos    users       15662 Mar 10 17:00 tfm-unix.txt



    bzip2 [ -cdfkqstvzVL123456789 ] [ filenames ...  ]
    bunzip2 [ -fkvsVL ] [ filenames ...  ] 

The {\tt bzip2} program is a more advanced and recent compressor. It compresses files using the Burrows-Wheeler block sorting text compression algorithm, and Huffman coding. Compression is generally considerably better than that achieved by more conventional LZ77/LZ78-based compressors (such as gzip). bzip2 can give 10-30% smaller files than gzip --- or it can produce a similarly-sized output while taking far longer.[3] In general, bzip2 functions in the same way as gzip. It has similar command line options and a helpful man page. It appends a .bz2 extension to any files compressed by it.

An example of bzip2:

kaleid@niflheim:~/pub$ ls -l
-rw-r--r--    1 ligos    users       15662 Mar 10 17:00 tfm-unix.txt
kaleid@niflheim:~/pub$ bzip2 tfm-unix.txt
kaleid@niflheim:~/pub$ ls -l
-rw-r--r--    1 ligos    users        6421 Mar 10 17:00 tfm-unix.txt.bz2
kaleid@niflheim:~/pub$ bunzip2 tfm-unix.txt.bz2
kaleid@niflheim:~/pub$ ls -l
-rw-r--r--    1 ligos    users       15662 Mar 10 17:00 tfm-unix.txt


    compress [-fv] [-b bits] [file...]
    compress [-cfv] [-b bits] [file]
    uncompress [-cfv] [ file...]

compress used to be dealt with in greater detail in the years gone by. It has been superceded by the gzip and bzip2 programs, which have a better compression ratio. Files compressed using compress have a .Z extension and can be uncompressed using gunzip or uncompress.

Here's the relevant part of the same example as above, but using compress:

kaleid@niflheim:~/pub$ ls -l
-rw-r--r--    1 ligos    users        8299 Mar 10 17:00 tfm-unix.txt.Z

Archiving Programs

Unix zip

The Unix utility zip works almost identically to PKZIP. All the command line options are the same - except for -k, which simulates a PKZIP made zipfile. All this really means is that it trims Unix file names down to the 8.3 format used by DOS. To find out how zip works, you can just type zip at the command line and the parameters will be listed. For more information, type man zip.

Tar, the Tape Archiver

tar is possibly one of the most useful utilities around. It works much like zip, but was originally designed for use with tape backups. Usage:

    tar [ - ] c|r|t|u|x [ bBefFhilmopvwX014578 ] [ tarfile ]
    [ blocksize ] [ exclude-file ] [ -I include-file ]
    filename1 filename2 ...  -C directory filenameN ...

tar archives and extracts multiple files onto a single tar file archive, called a tarfile. A tarfile is historically a magnetic tape, but nowadays it's most commonly a regular file on a hard drive or CD/DVD. tar's actions are controlled by the single-character arguments. Arguments you might want to pass to tar are file or directory names that specify which files to archive or extract. In all cases, the appearance of a directory name refers recursively to the files and subdirectories of that directory: as the name suggests, tar is particularly good for archive backups.

Tar has many options, most of which are irrelevant for every day use. If you really need to know what they do, you should read the man page for tar.

If you type tar --help on a machine with GNU tar, you will be shown the following list, among many other options:

This is GNU tar, the tape archiving program.
choose one of the following:
-A, --catenate,
   --concatenate       append tar files to an archive
-c, --create            create a new archive
-d, --diff,
   --compare           find differences between archive and file system
--delete                delete from the archive (not for use on mag tapes!)
-r, --append            append files to the end of an archive
-t, --list              list the contents of an archive
-u, --update            only append files that are newer than copy in archive
-x, --extract,
    --get               extract files from an archive
Other options:
-f, --file [HOSTNAME:]F use archive file or device F (default /dev/rmt8)
-G, --incremental       create/list/extract old GNU-format incremental backup
-g, --listed-incremental F create/list/extract new GNU-format incremental backup
-M, --multi-volume      create/list/extract multi-volume archive
-O, --to-stdout         extract files to standard output
-Z, --compress,
    --uncompress        filter the archive through compress
-z, --gzip,
    --ungzip            filter the archive through gzip
-j, --bzip2                        filter the archive through bzip2

The crucial set of crtux perform the following actions :

  • c Create a new tarfile.
  • r Write the named files on the end of a tarfile.
  • t List the contents of a tarfile.
  • u Add the named files to the tarfile if not already there or if they have been modified since they were last archived.
  • x Extract files from a tarfile. Only named files will be extracted, if no files are named, all files will be extracted. If a named file is a directory, all files in that directory will be extracted.

Possibly the most important function modifier is f. This indicates that you are working with a file rather than a tape. As you may never work with a tape, I suggest you commit this to memory. Use it every time you use tar, without question. It needs to be the last option specified before you type the name of the file to manipulate.

Other useful options include:

  • -h Makes tar treat symbolic links as ordinary files rather than ignore them (which is the default).
  • -v The verbose option. Makes tar show what it's doing as it's working.
  • -z Newer versions of tar support this option, which causes tar to decompress the file using gzip before untarring. This makes untarring .tar.gz files as simple as tar zxvf filename.tar.gz.
  • -j Even more recent versions of tar also support -j, which is similar to -z but uses bunzip2 rather than gunzip.

If a tarfile is given as "-", tar reads from stdin or writes to stdout, depending on if it is archiving or extracting.

You may encounter a version of tar that doesn't support any magical decompression switches. The workaround is easy; use a pipe.

kaleid@niflheim:~$ bunzip2 -c sample.tar.bz2 | tar xvf -


kaleid@niflheim:~$  gunzip -c sample.tar.gz | tar xf -

An example of an everyday use of tar could be something like this:

-rw-------  1 kaleid      39331 May 31  2011 sysprog.tar.gz
-rw-------  1 kaleid     157118 Mar  3  2011 tfm94.tar.gz
drwx------  2 kaleid       2048 Feb  5 14:56 tfm95
kaleid@niflheim:~/stuff$ tar -xzvf sysprog.tar.gz      


kaleid@niflheim:~/stuff$ ls sys*
drwx------  3 kaleid       1024 May 31  2011 sysprog
-rw-------  1 kaleid      39331 May 31  2011 sysprog.tar.gz

Notice that I used the z option to pipe the tarfile through gzip before sending it to tar. In fact, if you background the tar process (e.g. tar xzf thing.tar.z &) then use ps to look at your processes, you'll see that tar has started a gunzip process. Notice also that directories have been recursed, with most of the files going into a directory called sysprog, and a couple into a subdirectory of that called old_progs. Furthermore, I used the v option so that all the file generation messages would appear. If I had not done this, nothing would have appeared to happen until the next prompt appeared. It is best not to use the v option if you background the process, so as not to have all the warnings and messages getting in your way. You may have spotted that I used -xzvf rather than xzvf. This is merely personal choice, as either will work.

If you want to send a number of files to someone else using email, one of the most sensible solutions is to tar them all into a file, which you can then attach. This should be all you need to do - in the past, many mailers didn't support MIME natively, but now all any decent modern MUA can cope with binary attachments.[4]

Encoded Files

You have probably already seen encoded files by the time you read this. They turn up quite regularly in Usenet news[5]. Most of the time, decoding will be done for you by the mail or news reader you're using. But there are times when you'll need to encode or decode them manually.



    uuencode [ source-file ] decode_pathname

    uudecode [ -p ] [ encoded-file ]

uuencode converts a binary file into an encoded representation that can be sent using mail. It encodes the contents of the source file, or stdin} if no <tt>source-file argument is given. The decode_pathname argument is required. The decode_pathname is included in the encoded file's header as the name of the file into which uudecode is to place the binary (decoded) data. uuencode also includes the permission modes of the source file, (except setuid, setgid, and sticky-bits), so that the decoded file is created with those same permission modes. uuencode outputs to stdout. You must redirect the output by typing something like uuencode thing.file thing.file > thing.uue, otherwise you will just see it all rushing by your window.

uudecode reads an encoded-file, strips off any leading and trailing lines added by mailer programs, and recreates the original binary data with the filename and the mode specified in the header.

The encoded file is an ordinary portable character set text file; it can be edited by any text editor. It is best only to change the mode or decode_pathname in the header to avoid corrupting the decoded binary.

uudecode had the command line option -p, which makes it output to stdout instead of the file specified in the header.

An example of using uuencode and uudecode :

kaleid@niflheim:~/public_html/tuva$ ls -al
total 23
drwxr-xr-x  2 kaleid        512 Feb  6 14:01 .
drwx--x--x 14 kaleid       1536 Feb  3 06:00 ..
-rw-r--r--  1 kaleid       6619 Dec 14  2010 flag24.gif
-rw-r--r--  1 kaleid       2413 Dec 14  2010 hello.gif
-rw-r--r--  1 kaleid       2722 Dec 14 11:46 index.html
-rw-r--r--  1 kaleid       5530 Mar  7  2011 tuvaemb.gif
kaleid@niflheim:~/public_html/tuva$ uuencode hello.gif hello2.gif > hello.uue
kaleid@niflheim:~/public_html/tuva$ ls
flag24.gif      hello.uue       tuvaemb.gif 
hello.gif       index.html       
kaleid@niflheim:~/public_html/tuva$ head -3 hello.uue
begin 644 hello2.gif

Uuencoded files are about 35% larger than their sources - this is because 3 bytes become 4, along with the control information. This means that it will take longer to transmit an encoded file than to transmit a binary.



    mimencode[-u] [-b] [-q] [-p] [file name] [-o outputfile]

mimencode converts a byte stream into, or out of, one of the standard mail encoding formats defined by MIME, the proposed standard for internet multimedia mail formats. By default, mimencode reads stdin, and sends a "base64" encoded version of the input to stdout. The command line options are:

  • -b Use "base64" encoding. This is the default, so isn't really necessary.
  • -q Use "quoted-printable" encoding instead of "base64".
  • -u Decode rather than encode.
  • -p Translate decoded CR and LF sequences according to that required by the file system (i.e. CR/LF for DOS, CR for Unix, LF for Mac). This option is only of use if the -b option is used.
  • file name If a file name is given, this file will replace stdin as the input for mimencode.
  • -o outputfile Output to the file specified instead of to stdout.

Here I am, trying to use mimencode :

kaleid@niflheim:~/public_html/tuva$ ls
total 23
drwxr-xr-x  2 kaleid        512 Feb  6 14:10 .
drwx--x--x 14 kaleid       1536 Feb  3 06:00 ..
-rw-r--r--  1 kaleid       6619 Dec 14  2010 flag24.gif
-rw-r--r--  1 kaleid       2413 Dec 14  2010 hello.gif
-rw-r--r--  1 kaleid       2722 Dec 14 11:46 index.html
-rw-r--r--  1 kaleid       5530 Mar  7  2011 tuvaemb.gif
kaleid@niflheim:~/public_html/tuva$ mimencode hello.gif -o hello.mime
kaleid@niflheim:~/public_html/tuva$ ls
flag24.gif      hello.mime      tuvaemb.gif 
hello.gif       index.html       
kaleid@niflheim:~/public_html/tuva$ head -2 hello.mime

kaleid@niflheim:~/public_html/tuva$ mimencode -u hello.mime > hello2.gif
kaleid@niflheim:~/public_html/tuva$ ls
total 30
drwxr-xr-x  2 kaleid        512 Feb  6 14:28 .
drwx--x--x 14 kaleid       1536 Feb  3 06:00 ..
-rw-r--r--  1 kaleid       6619 Dec 14  1994 flag24.gif
-rw-r--r--  1 kaleid       2413 Dec 14  1994 hello.gif
-rw-------  1 kaleid       3265 Feb  6 14:26 hello.mime
-rw-------  1 kaleid       2413 Feb  6 14:28 hello2.gif
-rw-r--r--  1 kaleid       2722 Dec 14 11:46 index.html
-rw-r--r--  1 kaleid       5530 Mar  7  1995 tuvaemb.gif

You will notice several things about the above: I used a redirection ( mimencode -u hello.mime > hello2.gif) instead of the -o option. There is no functional difference, either can be used. The encoded file (hello.mime) is slightly smaller than the uuencoded version was, but mimencode doesn't save any of the file permissions, and if you forget what a mimencoded file is, there's no way of finding out its name because mimencoded files don't have a header. Some mail and news readers which handle base64 attach header which give you this information, but mimencode will fail[6] if these headers are there.


A newer standard on Usenet is yEnc, which functions along similar lines to base64 and uu encoding but uses previously-forbidden characters and thus its encodings take up significantly less space. There is a free encoder for \Unix available at, useful for decoding any binary attachments that you might find while browsing the newsgroups.

  1. There is also pack, but it's even more obsolete than compress.
  2. Read up on this and compression in general when you've got the time. It's really interesting! ---Ed.
  3. bzip2 is also very memory-hungry, because the more memory it uses the better its compression. Don't use bzip2 to backup your hard drive.
  4. It will do this by including them as MIME attachments.
  5. See the Usenet News Service chapter.
  6. In fact, it's worse than that. mimencode won't fail. It'll produce garbage output and won't warn you it didn't work. You must strip headers completely from base64 encoded files if you are going to use mimencode.
Personal tools