Sampling, Sketching or One Following the Other
I have a way to formalize this question as a theory problem. Consider each item and ask: should I skip this item, or should I sketch it? If this can be decided quickly (ie., o(sketching time)), and we can still guarantee approximation of desired functions, we have a win. The results I have with my coauthors Supratik Bhattacharyya, Andre Madeira and Tao Ye are that this is possible for many functions. There are two tricks:
- you have to use the norm of the skipped part to make this decision, and
- you need to estimate the norm of the skipped portion quickly ie in o(sketching time) per item. this is somewhat circular?
The approach provides a way to actually design nontrivial algorithms and prove theorems about approximations. For example, we get a factor 4+ eps approximation for estimating F_2 and get constant factor speedup over best known F_2 estimation on streams (which gets 1+eps approximation). Improvements to this will be cool. Our paper has other results, theoretical and applied. In the applied world, one of the biggest win we get is in finding heavy-hitters within contents of IP packets, where this method allows us to skip entire packets without processing them. Better understanding of this can help tremendously where people have to resort to specialized hardware to solve this problem.
Disclaimer: I have some randomized skipping ideas that gives better results than ones in this paper for some cases, but more on that later.