Friday, April 24, 2015

Streaming H

Items  I_i s (+ve integers,  I_i \in [1,U]) arrive one after another, and for any n, after having seen n items, you have to return an approximation to H_n which is the H-index of the first n items. I_1,...,I_n  (The H-index of a set of number is largest k such that there are at least k items each >= k). A student in Rutgers posed this problem.

This problem has a simple worstcase solution: sing log_{1+\eps} U space, get  (1+\eps) one-sided error approximation by maintaining an exponential histogram on the domain and counting the number of items in each bucket. (The same solution works if items arrive and depart as well. ) Now, U can be replace by max_{i <=n} I_i after n items, so it is input data-sensitive. If items are drawn from domain [L,R], you can splice out the lower buckets and the same solution works with about (log R-log L)  buckets (better than log (R-L)).  None of this is particularly interesting theory wise.

Is there an interesting version of this problem if the items are drawn iid from unknown distribution repeatedly? Curious people should see the connection to the Prophet inequality and frugal streaming results.