Kompress: An data compression exercise in C

The students in Jim Holmes' Computer Science 2 class (fall 1996) worked on a semester-long project to produce a file compression program. Many of the students were also in Eric Gossett's Discrete Math class. The Huffman compression algorithm was presented in the Discrete Math class. The algorithm was then translated into software using the supporting data structures that were developed in the initial portions of the Computer Science 2 class.

The students learned about design reviews by submitting their program designs to formal reviews. The project was specified in the analysis document below (translated into HTML from the original LaTeX format). The initial sections of the design document appear after the analysis document.

KomPress -- A file compression project for CS2

Jim Holmes
Fall, 1996

Introduction

With the current rates of increase of information being processed and stored, especially audio and visual information, it is rather important that we find efficient ways to save the information in formats requiring minimal storage space.

Our CS2 project this year is a little less ambitious. Rather than trying to compress general forms of data, we will concentrate on compressing textual information. What is required is a program, complete with a corresponding analysis, design and user interface documentation which compresses textual information, stores the result in a file and removes the original uncompressed information. Of course we also will need to provide the reverse uncompress capability.

Input to System

The system will be a command-line driven system. It should be invoked by issuing the following command at the UN*X prompt.

KomPress [ -v] [filename]

where the -v option requests statistical information on the percentage reduction achieved and filename indicates the name of a file containing textual information to be compressed.

If filename is a file of ASCII characters, a Huffman compression algorithm will be used to do the actual encoding.
If filename is a file already compressed by the KomPress system, uncompression will be performed.
If the file contains any other kind of data an explanation for no action will be provided for the user.

Usual UN*X invocations should be available. In particular, the absence of correct option or file name information should result in the standard `usage' information being displayed.

Output from System

All normal information (i.e., statistical info) is to be sent to stdout.
All error messages are to be sent to stderr.
Upon successful completion of encoding (compressing), the information should be stored in a file named filename.bz (Bethel Zip), where filename was the original name of the file.
Upon successful completion of decoding (uncompressing) and in the event that the original filename contained a final bz extension, the uncompressed information should be stored in a file having name with .bz removed. If no such extension exists on the original compressed file, the user should be warned that the system intends to overwrite the original file and should be given the option of aborting the uncompression.

KomPress Design Document

Jim Holmes
Fall, 1996

abstract

KomPress is a simplified version of the UNIX compression utility, compress. It uses a recent, though rather simple Huffman compression algorithm and provides for a reduced set of user services specified via normal command line switches.

Introduction

There are a number of purposes of this project.

Introduction to Data Abstraction
One of the major concepts in the second computer science course, as described in the ACM curriculum is the idea of encapsulating data structures and then defining interfaces of functions providing the only means of access and manipulation of the data. In this project, we will see several places where this approach to program design actually is at least as helpful as it is a pain to do. Please be advised that as projects increase in size and as system designs are modified, the real benefit from data abstraction becomes very clear.
Gentle Introduction to Software Engineering
Computer scientists, especially budding ones, seem to learn by moving from the concrete to the abstract. This project, with its somewhat simplistic analysis and design documents, as well as the rather abbreviated design, code and testing reviews should be viewed as only a first pass at the very important software engineering skills demanded in the field today and taught in our upper level courses here at Bethel. In addition, all work on the project will be done using the make utility. Though not the most current of these kinds of development tools, it is probably most basic and the mastery of it provides skills that can be carried over to the more modern systems encountered in industry.
Provides actual lab experience for Huffman algorithm
One of the major topics in Discrete Math is the notion of a tree. And one of the more recent and interesting applications of trees is the Huffman compression algorithm. This project has as one of its major goals the student mastery of programming techniques commonly used with tree data structures. This involves extensive use of pointers and recursive programming. It is also usually the case that we all experience extensive use of debuggers as we track down pesky errors we have made in our programming.

Fundamental Algorithm

The essential idea of the Huffman compression algorithm is to build a tree having the various characters as leaves, but placing the frequently used characters as close to the root as possible and the infrequently use ones as far away as necessary. A corresponding character (taking 8 bits in ASCII) would then be encoded as a 0/1-sequence by noting whether we move to a left child (0) as we traverse the tree toward the character or the right child (1). If the frequently used characters are close to the top, then we shouldn't need anywhere near 8 levels in the tree to locate them. And the infrequently used characters, though probably requiring more than 8 characters, are used so infrequently that the overall effect is to reduce the total number of bits needed for the original character sequence, sometimes by as much as 40%!