## Monday, March 27, 2006

I just came back from a walk to the east village (not lower or far east, just the plain east side). The days and nights of March are deceptive, they hint at the vanishing winter and the arriving spring, but catch you somewhere in-between most times. Then there are Mondays, when NY chefs take a break, restaurants close early, the fish and the produce are not fresh, and one might as well stay home and avoid the subs. Finally, the madness: 2nd Avenue Deli is shut down (altercation with the landlord it seems) and the perennial Kiev has closed after 30 years. Blocks have shuttered storefronts and graffitti. There are concrete barriers and city workers laying streets, constructors working with rigged lights to get new condos ready soonest, machinery in noisy motion throughout and the whole city feels like a warzone. Some days a walk through New York manages to be not charming.

## Saturday, March 25, 2006

### Theory circles

I saw a wonderful cartoon by Patrik Hardin. The legend says, "To be honest, I never would have invented the wheel if not for Urg's groundbreaking theoretical work with the circle." The cartoon itself is a gem: think of cavemen.

### Car Talk

I have wondered: any Tom and Ray's out there who can do a ComputerTalk in the inimitable style of Cartalk, that Lousy Radio Show? Computertalk, Desktoptalk, PCTalk, LapTalk? Quick Google search does not reveal anything like cartalk for computers.

## Saturday, March 18, 2006

### New York City: The Browsing Lane.

NY Songlines: a lovely ASCII description of New York City, block by block, building by building. It is positioned as a virtual walking tour, but I think of it as a browsing treat. If you are a new york city resident, you will learn not to react when Keri Russell is in the elevator with you with her groceries, Natalie Portman has brunch with family not far from you at Sarabeth's, or Yoko Ono gets a drink at the table next to you. You pay attention to where Daniel Day Lewis is a regular, Jim Jarmusch hangs out or where Al Pacino and Ann Penelope get off the taxi in Carlito's Way. Over time, each of us has a file in our head much like the one at the NY Songlines and browses along the memory lane.

### A Compression Problem

This is a problem of coming up with a workable formulation.

Consider compressing a string s=s_1...s_n using a Ziv-Lempel algorithm (say you have compressed s_1...s_i; replace the longest prefix s_{i+1}...s_j of s_{i+1}...s_n that appears earlier in s by say two pointers to its occurrence followed by s_{j+1}, and continue). What is its compressed size? In practice, one can compress s and look at the resulting number of bytes, but one needs an abstract compression measure to reason about compression from a theoretical point of view. The compression measure algorithmicists are happy with so to count the number of phrases (s_{i+1} .. s_j's) of the string that is generated by the Ziv-Lempel algorithm. It is a nice combinatorial quantity to work with, and has been useful in several algorithmic results including compressed matching.

Similarly, the problem of interest now is to come up with a compression measure for the Burrows-Wheeler (BW) compression. It is a gem of a compression method. It permutes the string and compresses the outcome using move-to-front, run-length and possibly arithmetic coding. The analyses in theoretical computer science have studied compression ratio of the string under the BW method in terms of the empirical entropy of the string; see for example, Manzini. It will be nice to have a different compression measure, something simple and combinatorial.

Consider compressing a string s=s_1...s_n using a Ziv-Lempel algorithm (say you have compressed s_1...s_i; replace the longest prefix s_{i+1}...s_j of s_{i+1}...s_n that appears earlier in s by say two pointers to its occurrence followed by s_{j+1}, and continue). What is its compressed size? In practice, one can compress s and look at the resulting number of bytes, but one needs an abstract compression measure to reason about compression from a theoretical point of view. The compression measure algorithmicists are happy with so to count the number of phrases (s_{i+1} .. s_j's) of the string that is generated by the Ziv-Lempel algorithm. It is a nice combinatorial quantity to work with, and has been useful in several algorithmic results including compressed matching.

Similarly, the problem of interest now is to come up with a compression measure for the Burrows-Wheeler (BW) compression. It is a gem of a compression method. It permutes the string and compresses the outcome using move-to-front, run-length and possibly arithmetic coding. The analyses in theoretical computer science have studied compression ratio of the string under the BW method in terms of the empirical entropy of the string; see for example, Manzini. It will be nice to have a different compression measure, something simple and combinatorial.

## Wednesday, March 15, 2006

### A Scheduling Problem

Problem: Jobs arrive online (arrival time a_i, processing time p_i known at arrival time) and have to be scheduled with preemption on one machine. The stretch of a job is s_i = (c_i - a_i)/p_i where c_i is the completion time of job i. Design an online algorithm to minimize max-stretch, ie., max_i s_i.

What is known: Say d= (max_i p_i)/(min_i p_i). If there are three or more job sizes, the lower bound on the competitive ratio is roughly d^{1/3} and the best known upper bound is roughly d^{1/2}.

Why is it interesting: Stretch is a natural measure. Seems like one needs a new idea and that may yield a nice new scheduling algorithm useful to test in practice. Related sum-stretch optimization is nearly solved? This is because shortest-remaining-processing-time (SRPT) does pretty well for it and there are even results for minimizing weighted flow time, \sum_i w_i(c_i-a_i) for arbitrary weights by Chekuri, Khanna and Zhu, and later by others.

Social commentary: I was reminded of this problem because Cliff Stein and I were discussing work sometime ago, and the discussion was whether the pursuit of stretch optimization had been worthwhile for the scheduling algorithms community. That SRPT is a near-optimizer for sum-stretch is interesting, but says something about SRPT that has been known for a long time. Did we get any new algorithms, insights or hard challenges? This is the kind of discussion researchers have, reflecting on the past to see for example if this (roughly) 10 year old research direction has been impactful. While the outcome is unclear, there is a ray of hope --- the problem of minimizing the max-stretch have been encouraging: the known algorithms do give new ways to schedule and has found applications in web servers, file transfers over the net and systems, and there is a technically challenging open problem (stated above).

What is known: Say d= (max_i p_i)/(min_i p_i). If there are three or more job sizes, the lower bound on the competitive ratio is roughly d^{1/3} and the best known upper bound is roughly d^{1/2}.

Why is it interesting: Stretch is a natural measure. Seems like one needs a new idea and that may yield a nice new scheduling algorithm useful to test in practice. Related sum-stretch optimization is nearly solved? This is because shortest-remaining-processing-time (SRPT) does pretty well for it and there are even results for minimizing weighted flow time, \sum_i w_i(c_i-a_i) for arbitrary weights by Chekuri, Khanna and Zhu, and later by others.

Social commentary: I was reminded of this problem because Cliff Stein and I were discussing work sometime ago, and the discussion was whether the pursuit of stretch optimization had been worthwhile for the scheduling algorithms community. That SRPT is a near-optimizer for sum-stretch is interesting, but says something about SRPT that has been known for a long time. Did we get any new algorithms, insights or hard challenges? This is the kind of discussion researchers have, reflecting on the past to see for example if this (roughly) 10 year old research direction has been impactful. While the outcome is unclear, there is a ray of hope --- the problem of minimizing the max-stretch have been encouraging: the known algorithms do give new ways to schedule and has found applications in web servers, file transfers over the net and systems, and there is a technically challenging open problem (stated above).

## Sunday, March 12, 2006

### Geometry and Anecdote

In the Putnam or Math Olympiad exams, historically, the problems on geometry see the worst performance from the students. I am not an enthusiast of geometric problems in general, but in homage to Suresh's Geomblog, I pay attention to them sometimes. Here is a partial anecdote I heard from Bill Steiger and can use any pointers. In one of the Antarctica expeditions (of Shackleton? Scott?), the crew of the ice-wrecked ship had to choose a few things to take with them when they abandoned the ship. Apparently, one of the items a crew member carried with him was a volume of the Encyclopedia Brittanica, and it was the volume on Geometry. Touching. Let Algebra and Trigonometry be drowned.

### Modern dance

I watched the Mark Morris Dance Group perform at the BAM. Usually I like watching contemporary dance in a small space where I can not only see the definition of body, shape and muscles of the dancers, but also hear the thump of their bare feet on the wooden floors, their silenced breathing, the swirl of their sweat, toss of their hair and the minute russles of their clinging clothes. The 10 or so other audience members also experience it, and believe me, in NY city the audience is equally theatrical and notices everything. The MMDG's performance was in a large opera hall at the BAM and consequently, seemed so much more sanitized. It was still exciting. I got a sense for the weird geometry of their motion in groups with overlapping spirals that shrunk, collpased, regrew and merged.

### Streaming exercise

Here is an (HW?) exercise in streaming algorithm design.

Given a stream of items a_1,...,a_n, a_i \in [1,u], find the average of items below the median or the median of items below the average.

There is previous work on finding the median (or more generally, the quantiles) and the kth frequency moments \sum_i f_i^k, where f_i is the frequency of i (k=1 is the average). The exercise here is an example of cascaded aggregation: computing the kth frequency moment following a quantile computation.

As is standard in streaming algorithms, one has to use space and per-item update time of polylog(n,u) and obtain reasonable approximation.

Given a stream of items a_1,...,a_n, a_i \in [1,u], find the average of items below the median or the median of items below the average.

There is previous work on finding the median (or more generally, the quantiles) and the kth frequency moments \sum_i f_i^k, where f_i is the frequency of i (k=1 is the average). The exercise here is an example of cascaded aggregation: computing the kth frequency moment following a quantile computation.

As is standard in streaming algorithms, one has to use space and per-item update time of polylog(n,u) and obtain reasonable approximation.

## Sunday, March 05, 2006

### PODS and Streams

ACM Conf on Principles of Database Systems (PODS) has always had a healthy number of papers from the FOCS/STOC/SODA researchers (living up to the "principles" part of PODS?). True this time as well. See accepted papers at PODS 2006. Also, one is beginning to see more papers from Google (D. Sivakumar, Gagan Aggarwal and An Zhu have papers) in PODS.

There are a bunch of streaming/sublinear space algorithms papers:

Approximate Quantiles and the Order of the Stream

Sudipto Guha, Andrew McGregor (UPenn)

Counting Triangles in Data Streams

Luciana Buriol (Universidade Federal de Santa Maria),

Gereon Frahling (University of Paderborn),

Stefano Leonardi, Alberto Marchetti-Spaccamela (University of Rome La Sapienza)

Randomized Computations on Large Data Sets: Tight Lower Bounds

Martin Grohe, Andre Hernich, Nicole Schweikardt (Humboldt-University Berlin)

Deterministic k-set structure

Sumit Ganguly, Anirban Majumder (IIT Kanpur)

and a few more. So, streaming thrives in PODS and elsewhere.

There are a bunch of streaming/sublinear space algorithms papers:

Approximate Quantiles and the Order of the Stream

Sudipto Guha, Andrew McGregor (UPenn)

Counting Triangles in Data Streams

Luciana Buriol (Universidade Federal de Santa Maria),

Gereon Frahling (University of Paderborn),

Stefano Leonardi, Alberto Marchetti-Spaccamela (University of Rome La Sapienza)

Randomized Computations on Large Data Sets: Tight Lower Bounds

Martin Grohe, Andre Hernich, Nicole Schweikardt (Humboldt-University Berlin)

Deterministic k-set structure

Sumit Ganguly, Anirban Majumder (IIT Kanpur)

and a few more. So, streaming thrives in PODS and elsewhere.

### Blue

It is a bright blue morning. BAM hosts the weekend of BAMKids film festival and at 2.15 will screen Catfish blues by Hoel Caouissin. It is a story of someone who invents songs while he fishes and dreams of being a blues musician, eventually becoming, after blues-like life odyssey, the legendary Catfish Blues. Catfish blues has roots in animal-like characterization of blacks, fishing and imagery. Here are two fun versions.

“I wished I was a catfish, swimming down in the sea;

I ‘d have some good woman, fishing after me.”

“I wants to make this one right.

This is the (best) one I got.”

Then he hung his head an’ went to singin’ an’ cryin’:

“Now, I wished that I was a bullfrog, swimmin’ in that deep blue sea.

Lord, I would have all these good lookin’ women, now, now, now, fishin after me.

Fishin’ after (guitar completes line)…

I mean after …..

Sure nuff, after me”.

“I wished I was a catfish, swimming down in the sea;

I ‘d have some good woman, fishing after me.”

“I wants to make this one right.

This is the (best) one I got.”

Then he hung his head an’ went to singin’ an’ cryin’:

“Now, I wished that I was a bullfrog, swimmin’ in that deep blue sea.

Lord, I would have all these good lookin’ women, now, now, now, fishin after me.

Fishin’ after (guitar completes line)…

I mean after …..

Sure nuff, after me”.

## Wednesday, March 01, 2006

### Group Testing Problems (One known, One less known?)

I recently gave a talk at Yale (Thanks to Dan Spielman for hosting; harmless exposé: Ravi Kannan is a coffee afficionado.) on Compressed Sensing. In the proof outline, I described using two different parallel group testing procedures via k-selectors, one for filtering and the other for verification. I realized I do a bad job of talking about group testing problems in general: I find it easy to go to a coin-weighing puzzle and connect with the audience; but one has to map the current problem to group testing which typically involves technical details, and then I drop the ball transitioning from there to pointing out the novel group testing variant I need for my talk. IOW, I make all group testing problems look alike! While I work on that, here are two coin-weighing puzzles, one classic and the other sorta not well known?

PARTY FAVORITE: Given 12 identical coins one of which is defective--may be heavy/light wrt to the good coins---use at most 3 pan balance weighings (ie only comparison of weights of sets of coins) to find the defective coin. Show that with 14 coins, one needs more than 3 weighings. Show that with 13 coins, one needs more than 3 weighings, but 3 weighings will suffice if one has an *extra* good coin.

All these claims can be generalized to roughly (3^w)/2 coins and w weighings.

LESS KNOWN: We have three coins (a,b,c) with respective partners (a',b',c'). Each coin is Heavy or Light; if a coin is Heavy, its partner is Light and vice versa. All Heavy coins weigh the same and likewise all Light coins. Using at most 2 pan-balance weighings find the state of each of the coin.

Extra credit: Impossible to solve in 2 weighings if none of the coins are provided with a partner; an adaptive procedure will work if any one coin has a partner; a non-adapative procedure will work if any two coins have partners; the puzzle has redundant information because all three coins have partners.

RESEARCH: Generalize the puzzle above to n coins, w weighings and gets precise bounds.

PARTY FAVORITE: Given 12 identical coins one of which is defective--may be heavy/light wrt to the good coins---use at most 3 pan balance weighings (ie only comparison of weights of sets of coins) to find the defective coin. Show that with 14 coins, one needs more than 3 weighings. Show that with 13 coins, one needs more than 3 weighings, but 3 weighings will suffice if one has an *extra* good coin.

All these claims can be generalized to roughly (3^w)/2 coins and w weighings.

LESS KNOWN: We have three coins (a,b,c) with respective partners (a',b',c'). Each coin is Heavy or Light; if a coin is Heavy, its partner is Light and vice versa. All Heavy coins weigh the same and likewise all Light coins. Using at most 2 pan-balance weighings find the state of each of the coin.

Extra credit: Impossible to solve in 2 weighings if none of the coins are provided with a partner; an adaptive procedure will work if any one coin has a partner; a non-adapative procedure will work if any two coins have partners; the puzzle has redundant information because all three coins have partners.

RESEARCH: Generalize the puzzle above to n coins, w weighings and gets precise bounds.