Saturday, July 26, 2008

Don't Cares

Given text string t[1...n] and pattern string p[1...m] over alphabet of size k, the string matching problem is to find all i's such that t[i]...t[i+m-1]=p[1]...p[m], ie., \AND_j (t[i+j-1] matches p[j]). Fischer and Paterson in '74 observed that solving this problem is "like" computing convolution where given vector a[1...n] and b[1...m], we compute c[i]= \sum_j (a[i+j-1] \prod b[j] ); this can be done in O(nlog m) time.

This approach solved string matching where some of the positions were don't cares" that matched any symbol, using 2\log k convolutions, therefore in time O(n log k log m) altogether. This is something of a cult of a problem, with a lot of attention. What follows is my tale of embarassment in research.

In my thesis work in '94, I observed using some relationship to Sperner's Lemma that log k + 1/2 loglog k + O(1) convolutions would suffice. Piotr Indyk found a sophisticated way to solve the entire problem in O(nlog m) randomized time. Adam Kalai blew me away with a page long note in SODA with a randomized reduction to O(1) convolutions. Then, Cole and Hariharan found nifty encoding techniques, some using complex and rational arithmetic, to do it in O(1) convolutions in the worst case.

When I was in Bristol, Raphael described his '06 solution to me in one mathematical sentence: \AND_j (t[i+j-1] matches p[j]) follows from \sum_j (t[i+j-1]-p[j])^2 X_{i+j-1} Y_j, where X_{i+j-1} is the indicator variable for if t[i+j-1] is a don't care (and likewise, Y for p). Now expand that expression out and in 3 convolutions, you are done. I came across something called the Bridge of Sighs in Venice, that moment with Raphael was it.