Thursday, October 25, 2007

Sampling Research

To theoretical computer scientists, random sampling comes naturally. We think it is simple, elegant even, and useful. In practice, random sampling presents problems. If you walk into a cafe in NY, 90%+ people have macs (and look stylish); walk into an airport, and 90%+ people use windows machines (and look much less chic). Any estimate you get from these "samples" by themselves or jointly for the market share of these products is (doable but) difficult. Worse, I have been working the past year or so putting sampling primitives into a database engine. Different users need different kinds of samples (eg., biased, distinct, fixed size, persistent, over different attributes, with or without replacements), no single sample meets all requirements and maintaining all these different samples is a systems nightmare. Also, users want the same answer if they rerun a query! Take all this into account, and putting sampling primitives into a database becomes a research problem.

1 Comments:

Anonymous Anonymous said...

putting sampling primitives into a database becomes a research problem.

More commonly known as statistics. :-)

Producing an unbiased sample is very difficult, but if you manage to get one such it is amazing the levels of confidence predicted by probability theory on very small samples.

7:32 AM  

Post a Comment

<< Home