### Probabilistic Data Streams

There is a new model of probabilistic data streams by Jayram, Kale and Vee: Each item on the data steam is an independent probability distribution, specified say with a small pdf/cdf. Such a stream specifies a probability distribution over the space of all deterministic streams. Now, one can ask about the expected value of various aggregates on such streams and try to estimate them in polylog space and polylog time per new item, much as in standard data stream models.

This is a nice abstract model to study a bunch of problems that arise: sometimes input data stream is probabilistic, some times the data stream is probabilistic because the sensors that do the measurement introduce error and noise, some times the input stream is deterministic but we derive stream from the input by say tagging each item with a bunch of values it may possibly represent, and the result is a probabilistic stream.

A fundamental problem here is, are there any issues beyond just instantiating a bunch of deterministic streams by instantiating each item of the data stream with the given pdf? Other problems include: can we take into account some concentration property of the individual pdfs to get some smooth complexity for problems that goes from deterministic streams to probabilistic streams? is there a suitable model of smooth pdfs for say geometric data streams with interesting algorithmic problems?

This is a nice abstract model to study a bunch of problems that arise: sometimes input data stream is probabilistic, some times the data stream is probabilistic because the sensors that do the measurement introduce error and noise, some times the input stream is deterministic but we derive stream from the input by say tagging each item with a bunch of values it may possibly represent, and the result is a probabilistic stream.

A fundamental problem here is, are there any issues beyond just instantiating a bunch of deterministic streams by instantiating each item of the data stream with the given pdf? Other problems include: can we take into account some concentration property of the individual pdfs to get some smooth complexity for problems that goes from deterministic streams to probabilistic streams? is there a suitable model of smooth pdfs for say geometric data streams with interesting algorithmic problems?

## 3 Comments:

We studied smooth varying pdfs in an SSDBM paper. We termed them "drifting distributions". This reflects the real case scenario in which the process that produces the inputs is changing or evolving over time, sometimes dramatically thus altering the underlying distribution. The precision of our predictions over such model is inversely proportional to the rate of change in the underlying process as measured by the leading coefficient of the Lipschitz condition.

Here's the ref:

Finding Frequent Items in Sliding Windows with Multinomially-Distributed Item Frequencies, Lukasz Golab, David DeHaan, A. López-Ortiz and Erik Demaine. In Proceedings of 16th International Conference on Scientific and Statistical Database Management (SSDBM), IEEE Computer Society, pp. 425-426, 2004.

http://www.cs.uwaterloo.ca/~alopez-o/cspap/ssdbm/

Alex

Good information, I think the best best part is to know that the sensors that do the measurement introduce error and noise, that's what we really should know.m10m

How to control these data streams effectively? honestly is very complicate for me to manipulate this stuff

Post a Comment

<< Home