Julian's Science Experiments
  • Famous Experiments and Inventions
  • The Scientific Method
  • Home Computer Experiments Computer Science Fair Projects Computer Jokes Warning!

    Image & Data Compression
    K-12 Experiments & Background Information
    For Science Labs, Lesson Plans, Class Activities & Science Fair Projects
    For Elementary, Middle and High School Students and Teachers

    Data & Image Compression Experiments

    Data Compression Background Information


    Data compression is the process of encoding information using fewer bits than an unencoded representation would use, through use of specific encoding schemes.

    Image compression is the application of data compression on digital images. In effect, the objective is to reduce redundancy of the image data in order to be able to store or transmit data in an efficient form.


    Data compression is a set of steps for packing data into a smaller space, while allowing for the original data to be seen again. Compression is a two-way process: a compression algorithm can be used to make a data package smaller, but it can also be run the other way, to decompress the package into its original form. Data compression is useful in computing to save disk space, or to reduce the bandwidth used when sending data (eg, over the internet).

    Lossless compression

    Lossless compression packs data in such a way that the compressed package can be decompressed, and the data can be pulled out exactly the same as it went in. This is very important for computer programs and archives, since even a very small change in a computer program will make it unusable.

    This type of compression works by reducing how much waste space is in a piece of data. For example, if you receive a data package which contains "AAAAABBBB", you could compress that into "5A4B", which has the same meaning but takes up less space. This type of compression is called "run-length encoding", because you define how long the "run" of a character is. In the above example, there are two runs: a run of 5 A's, and another of 4 B's.

    The problem with run-length encoding is that it only works on long pieces of the same value of data. If you receive a package with "ABBAABAAB" inside, that can be compressed into "1A2B2A1B2A1B"; but that's longer than the original! In this case, there's another method that can be used: checking how often a particular value comes up in the whole data package. This is often called frequency compression.

    The most common kind of frequency compression is called Huffman coding, after the scientist who came up with the idea. The basic plan is to give each distinct value in a piece of data a code: values that crop up all the time get shorter codes, and values that only show up once or twice get longer codes.

    Examples of lossless compression

    • Archiving formats: Zip, GZip, bZip2, 7-Zip, etc.
    • Images/diagrams: GIF, PNG, PCX
    • Program compressors: UPX

    Lossy compression

    For some types of data, compression can go a lot further; this is most often the case with media files, like music and images. Beyond a certain level of fine-grain detail, or past a particularly high tone, people do not notice if the information is missing. As a result, it can simply be removed from the data.

    Of course, this won't work for computer programs and other such data where every piece is important; throwing away large pieces of a computer program is generally unhealthy for the program.

    Examples of lossy compression

    • Images: JPEG
    • Audio: MP3, Windows Media
    • Video: MPEG, DivX, Windows Video

    Topics of Interest

    In computer science and information theory, data compression or source coding is the process of encoding information using fewer bits (or other information-bearing units) than an unencoded representation would use, through use of specific encoding schemes.

    As with any communication, compressed data communication only works when both the sender and receiver of the information understand the encoding scheme. For example, this text makes sense only if the receiver understands that it is intended to be interpreted as characters representing the English language. Similarly, compressed data can only be understood if the decoding method is known by the receiver.

    Compression is useful because it helps reduce the consumption of expensive resources, such as hard disk space or transmission bandwidth. On the downside, compressed data must be decompressed to be used, and this extra processing may be detrimental to some applications. For instance, a compression scheme for video may require expensive hardware for the video to be decompressed fast enough to be viewed as it's being decompressed (the option of decompressing the video in full before watching it may be inconvenient, and requires storage space for the decompressed video). The design of data compression schemes therefore involves trade-offs among various factors, including the degree of compression, the amount of distortion introduced (if using a lossy compression scheme), and the computational resources required to compress and uncompress the data.

    Lossless versus lossy compression:

    Lossless compression algorithms usually exploit statistical redundancy in such a way as to represent the sender's data more concisely without error. Lossless compression is possible because most real-world data has statistical redundancy. For example, in English text, the letter 'e' is much more common than the letter 'z', and the probability that the letter 'q' will be followed by the letter 'z' is very small. Another kind of compression, called lossy data compression or perceptual coding, is possible if some loss of fidelity is acceptable. Generally, a lossy data compression will be guided by research on how people perceive the data in question. For example, the human eye is more sensitive to subtle variations in luminance than it is to variations in color. JPEG image compression works in part by "rounding off" some of this less-important information. Lossy data compression provides a way to obtain the best fidelity for a given amount of compression. In some cases, transparent (unnoticeable) compression is desired; in other cases, fidelity is sacrificed to reduce the amount of data as much as possible.

    Lossless compression schemes are reversible so that the original data can be reconstructed, while lossy schemes accept some loss of data in order to achieve higher compression.

    However, lossless data compression algorithms will always fail to compress some files; indeed, any compression algorithm will necessarily fail to compress any data containing no discernible patterns. Attempts to compress data that has been compressed already will therefore usually result in an expansion, as will attempts to compress all but the most trivially encrypted data.

    In practice, lossy data compression will also come to a point where compressing again does not work, although an extremely lossy algorithm, like for example always removing the last byte of a file, will always compress a file up to the point where it is empty.

    An example of lossless vs. lossy compression is the following string:


    This string can be compressed as:


    Interpreted as, "twenty five point 9 eights", the original string is perfectly recreated, just written in a smaller form. In a lossy system, using


    instead, the exact original data is lost, at the benefit of a smaller file size.

    Applications: The above is a very simple example of run-length encoding, wherein large runs of consecutive identical data values are replaced by a simple code with the data value and length of the run. This is an example of lossless data compression. It is often used to optimize disk space on office computers, or better use the connection bandwidth in a computer network. For symbolic data such as spreadsheets, text, executable programs, etc., losslessness is essential because changing even a single bit cannot be tolerated (except in some limited cases).

    For visual and audio data, some loss of quality can be tolerated without losing the essential nature of the data. By taking advantage of the limitations of the human sensory system, a great deal of space can be saved while producing an output which is nearly indistinguishable from the original. These lossy data compression methods typically offer a three-way tradeoff between compression speed, compressed data size and quality loss.

    Lossy image compression is used in digital cameras, to increase storage capacities with minimal degradation of picture quality. Similarly, DVDs use the lossy MPEG-2 Video codec for video compression.

    In lossy audio compression, methods of psychoacoustics are used to remove non-audible (or less audible) components of the signal. Compression of human speech is often performed with even more specialized techniques, so that "speech compression" or "voice coding" is sometimes distinguished as a separate discipline from "audio compression". Different audio and speech compression standards are listed under audio codecs. Voice compression is used in Internet telephony for example, while audio compression is used for CD ripping and is decoded by audio players.

    There is a close connection between machine learning and compression: a system that predicts the posterior probabilities of a sequence given its entire history can be used for optimal data compression (by using arithmetic coding on the output distribution), while an optimal compressor can be used for prediction (by finding the symbol that compresses best, given the previous history). This equivalence has been used as justification for data compression as a benchmark for "general intelligence".

    Source: Wikipedia (All text is available under the terms of the GNU Free Documentation License and Creative Commons Attribution-ShareAlike License.)

    Useful Links
    Computer Science and Engineering Science Fair Projects and Experiments
    General Science Fair Project Resources
    Electronics & Computer Project Books


    My Dog Kelly

    Follow Us On:

    Privacy Policy - Site Map - About Us - Letters to the Editor

    Comments and inquiries could be addressed to:

    Last updated: June 2013
    Copyright © 2003-2013 Julian Rubin