Kompress: An data compression exercise in C
The students in Jim Holmes'
Computer Science 2 class (fall 1996) worked on a semester-long project to produce a file compression program. Many of the students were also in Eric Gossett's
Discrete Math class. The Huffman compression algorithm was presented in the
Discrete Math class. The algorithm was then translated into software using the supporting data structures that were developed in the initial portions of the
Computer Science 2 class.
The students learned about design reviews by submitting their program designs to formal reviews. The project was specified in the analysis document below (translated into HTML from the original LaTeX format). The initial sections of the design document appear after the analysis document.
KomPress -- A file compression project for CS2
Jim Holmes
Fall, 1996
Introduction
With the current rates of increase of information being processed and
stored, especially audio and visual information, it is rather important
that we find efficient ways to save the information in formats requiring
minimal storage space.
Our CS2 project this year is a little less ambitious. Rather than trying
to compress general forms of data, we will concentrate on compressing
textual information. What is required is a program, complete with
a corresponding analysis, design and user interface documentation which
compresses textual information, stores the result in a file and removes
the original uncompressed information. Of course we also will need to
provide the reverse uncompress capability.
Input to System
The system will be a command-line driven system. It should be invoked by
issuing the following command at the UN*X prompt.
KomPress [ -v] [filename]
where the -v option requests statistical information on the
percentage reduction achieved and filename indicates the name of
a file containing textual information to be compressed.
- If filename is a file of ASCII characters, a Huffman compression
algorithm will be used to do the actual encoding.
- If filename is a file already compressed by the KomPress system,
uncompression will be performed.
- If the file contains any other kind of data an explanation for no action
will be provided for the user.
Usual UN*X invocations should be available. In particular, the absence of
correct option or file name information should result in the standard
`usage' information being displayed.
Output from System
- All normal information (i.e., statistical info) is to be
sent to stdout.
- All error messages are to be sent to stderr.
- Upon successful completion of encoding (compressing), the
information should be stored in a file named filename.bz (Bethel Zip), where filename was the original name of the file.
- Upon successful completion of decoding (uncompressing)
and in the event that the original filename contained a
final bz extension, the uncompressed information should be
stored in a file having name with .bz removed. If no such
extension exists on the original compressed file, the user should
be warned that the system intends to overwrite the original file
and should be given the option of aborting the uncompression.
Jim Holmes
Fall, 1996
abstract
KomPress is a simplified version of the
UNIX compression
utility,
compress. It uses a recent, though rather simple
Huffman compression algorithm and provides for a reduced set of
user services specified via normal command line switches.
Introduction
There are a number of purposes of this project.
- Introduction to Data Abstraction
One of the major concepts in the second computer science course,
as described in the ACM curriculum is the idea of encapsulating
data structures and then defining interfaces of functions providing
the only means of access and manipulation of the data. In this
project, we will see several places where this approach to
program design actually is at least as helpful as it is a pain
to do. Please be advised that as projects increase in size and
as system designs are modified, the real benefit from data
abstraction becomes very clear.
- Gentle Introduction to Software Engineering
Computer scientists, especially budding ones, seem to learn by
moving from the concrete to the abstract. This project, with
its somewhat simplistic analysis and design documents, as well
as the rather abbreviated design, code and testing reviews
should be viewed as only a first pass at the very important
software engineering skills demanded in the field today and taught
in our upper level courses here at Bethel. In addition, all
work on the project will be done using the make utility.
Though
not the most current of these kinds of development tools, it is
probably most basic and the mastery of it provides skills that
can be carried over to the more modern systems encountered in
industry.
- Provides actual lab experience for Huffman algorithm
One of the major topics in Discrete Math is the notion of a tree. And one of the more recent and interesting applications
of trees is the Huffman compression algorithm. This project
has as one of its major goals the student mastery of programming
techniques commonly used with tree data structures. This involves
extensive use of pointers and recursive programming. It is also
usually the case that we all experience extensive use of debuggers
as we track down pesky errors we have made in our programming.
Fundamental Algorithm
The essential idea of the Huffman compression algorithm is to build a tree
having the various characters as leaves, but placing the frequently
used characters as close to the root as possible and the infrequently
use ones as far away as necessary. A corresponding character (taking 8
bits in ASCII) would then be encoded as a 0/1-sequence by noting whether
we move to a left child (0) as we traverse the tree toward the character
or the right child (1). If the frequently used characters are close to the
top, then we shouldn't need anywhere near 8 levels in the tree to locate
them. And the infrequently used characters, though probably requiring more
than 8 characters, are used so infrequently that the overall effect is to
reduce the total number of bits needed for the original character sequence,
sometimes by as much as 40%!