Wednesday, December 09, 2009

Compressing Many

We now know how to compress individual strings, down to the various notions of optimal entropy inherent in Lempel-Ziv to Burrows-Wheeler methods. But what is the theory of compressing a set S of strings, ie, what are optimal compression methods and suitable notions of entropy? I can imagine approaches based on (i) models for the set of strings, (ii) grammar based compression, (iii) converting S into a tree, or (iv) concatenating the individual strings in some order to reduce to the single string problem, etc., but I am curious about the optimal entropy.

Labels:

Hi Muthu,

What is the notions of optimal entropy inherent in the Burrows-Wheeler method? Lempel-Ziv compression achieves the entropy of a Markov source (up to lower order terms), but the common variants of Burrows-Wheeler compression do not...

Also, about grammar based compression, do you know of any good theoretical analysis of such compression? (Hopefully even a survey?)

One of my strong feeling is that Markov models are not a very good model to measure compressors on (partly because English is not generated by Markov model, and in fact there are known "gaps" between the compressibility of English and of Markov models which have reasonable parameters); but I don't know of any other good model to measure on (hopefully something that would model English better, or at least would model another aspect of English than that modeled by Markov models)

So, I don't think that we even understand the case of one string.

(about BWT compression, you might be interested in my paper with Haim from CPM '07:
)

11:06 AM
Anonymous said...

* Burrows-Wheeler achieves optimality wrt k-th order empirical entropy, which may be more interesting than entropy of Markov sources in some cases. A good reference is this work of Giovanni Manzini.

* Grammar based compression. There is a paper by Charikar et al, and another by Rytter both of which you can find by search. There seems to be this survey as well.

- Metoo

3:10 AM

Hi Muthu,

Thanks for the info!

* The work you point to shows an upper bound of 5 times the empirical entropy (plus a lower order term), rather than 1 times the empirical entropy. The term "optimality", as far as I know, only means that you're 1 times the empirical entropy.

Also, optimality on a Markov source is a strictly weaker property than optimality w.r.t. empirical entropy. In my original message I referred to our proof that BWT based compressors are provably not optimal on Markov sources (even those with realistic parameters). This implies that they're not optimal with respect to empirical entropy as well.

* Thanks for the link to the survey. I didn't know it before. I wonder if there are any known compressors that are optimal on Markov models (or w.r.t. empirical entropy) and are also good w.r.t the smallest grammar. Also, another interesting thing is that since English is not captured by Markov models but also not captured by small grammars, I wonder if one could define a reasonably-natural hybrid model that would capture both concepts, and hopefully would be a better model for English as far as compression is concerned..

11:20 PM
Anonymous said...

* Right, I was comfortable with near-optimality of BW wrt empirical entropy.

* In general, would be good to see some reductions between these models (Markov, grammar, hybrid ...).

-- Metoo

5:25 AM