Friday, December 22, 2006

Sampling, Sketching or One Following the Other

I was at the Workshop on Algorithms for Data Streams at IIT Kanpur the past few days. A question that was posed: we can use sketches (projection of the input signal along random directions) to approximate different functions on streams, but in applications such as IP traffic analysis, one can not afford to do even the work for sketching on each new item; so, we can only subsample from the stream first before applying the sketching solutions. How well does this procedure work?

I have a way to formalize this question as a theory problem. Consider each item and ask: should I skip this item, or should I sketch it? If this can be decided quickly (ie., o(sketching time)), and we can still guarantee approximation of desired functions, we have a win. The results I have with my coauthors Supratik Bhattacharyya, Andre Madeira and Tao Ye are that this is possible for many functions. There are two tricks:

• you have to use the norm of the skipped part to make this decision, and
• you need to estimate the norm of the skipped portion quickly ie in o(sketching time) per item. this is somewhat circular?

The approach provides a way to actually design nontrivial algorithms and prove theorems about approximations. For example, we get a factor 4+ eps approximation for estimating F_2 and get constant factor speedup over best known F_2 estimation on streams (which gets 1+eps approximation). Improvements to this will be cool. Our paper has other results, theoretical and applied. In the applied world, one of the biggest win we get is in finding heavy-hitters within contents of IP packets, where this method allows us to skip entire packets without processing them. Better understanding of this can help tremendously where people have to resort to specialized hardware to solve this problem.

Disclaimer: I have some randomized skipping ideas that gives better results than ones in this paper for some cases, but more on that later.