Saturday, March 18, 2006

A Compression Problem

This is a problem of coming up with a workable formulation.

Consider compressing a string s=s_1...s_n using a Ziv-Lempel algorithm (say you have compressed s_1...s_i; replace the longest prefix s_{i+1}...s_j of s_{i+1}...s_n that appears earlier in s by say two pointers to its occurrence followed by s_{j+1}, and continue). What is its compressed size? In practice, one can compress s and look at the resulting number of bytes, but one needs an abstract compression measure to reason about compression from a theoretical point of view. The compression measure algorithmicists are happy with so to count the number of phrases (s_{i+1} .. s_j's) of the string that is generated by the Ziv-Lempel algorithm. It is a nice combinatorial quantity to work with, and has been useful in several algorithmic results including compressed matching.

Similarly, the problem of interest now is to come up with a compression measure for the Burrows-Wheeler (BW) compression. It is a gem of a compression method. It permutes the string and compresses the outcome using move-to-front, run-length and possibly arithmetic coding. The analyses in theoretical computer science have studied compression ratio of the string under the BW method in terms of the empirical entropy of the string; see for example, Manzini. It will be nice to have a different compression measure, something simple and combinatorial.

0 Comments:

Post a Comment

<< Home