## Saturday, March 14, 2009

### Puzzle via blogs

Researchers in histogram construction (yes, there are several) will recognize the following problem: given an array A of size N, for each interval [l,r] in [1,N], determine the median of items in the interval A[l],A[l+1],....,A[r]. How quickly can one solve this problem? I had a few playful moments a couple of weeks ago and wrote this note, but with more play time, one should have a simpler, O(N^2) time algorithm. Further, one should be able to do this using o(N^2) space, not counting the output size. No approximations please. Enjoy!

Labels:

Anonymous said...

Form the sorted sequence of items in the array, as a doubly linked list, and create an array of pointers from positions in A to the nodes in this list.
Find the median item in the list (the median of A[1..n]).

Now do some sort of depth first traversal of a DAG in which each node represents a pair (i,j), the root node is (1,n), and each node (i,j) points to two nodes (i+1,j) and (i,j-1). (You don't need to construct and store the DAG explicitly, so this part can all be done in O(n^2) time and O(n) space, I just find it easier to think about doing it this way). When doing a forward traversal from (i,j) to (i+1,j) or (i,j+1), splice a single element out of the doubly linked list and use the value of that removed element to determine whether the median of the smaller subinterval is in the same place, one step to the left, or one step to the right in the doubly linked list. Maintain pointers to the neighbors of the spliced-out node on a stack so that, when returning to (i,j), the removed element can be added back into the doubly-linked list.

Time O(n^2), space O(n). Am I missing something?

12:09 PM
Anonymous said...

That should absolutely work. Faith Fich mentioned a similar outline (the key observation being the median moves at most one position left or right as the interval boundary moves one position). I did not realize this will just work out with linear space. Cool!

Is there an interesting question where the median can be replaced by something else, like say the convex hull (area) of each interval? (sum/min/max, may be also mode, dont seem interesting).

-- metoo

1:20 PM
Anonymous said...

I thought she was going by Faith Ellen these days?

1:26 PM
Anonymous said...

My mistake, typed off the note too quickly (also did not do any sanity check on the solution, just looked at the overall flow), and of course made a mistake. Faith Ellen. http://www.cs.toronto.edu/~faith/

-- metoo

1:48 PM
Anonymous said...

A bit off topic, but shouldn't it be possible to turn a balanced wavelet tree into an O(n \log n)-bit data structure that answers range-median queries in polylog(n) time? At each node of the wavelet tree, we can use a couple of rank queries to determine how many elements in the given interval are less than or equal to the largest number in the left subtree, and how many greater; that should tell us whether the interval's median is in the left or right subtree, no? (Sorry to disturb if this a) doesn't work, b) doesn't matter or c) has been done.)

Cheers,
Travis

1:15 AM
Anonymous said...

Hi Travis,

Check out:
http://www.cs.rutgers.edu/~muthu/medians.pdf

http://arxiv.org/abs/0901.1761

-- metoo

2:50 AM
Anonymous said...

Will do! Just thinking, though, that it seems possible to use a wavelet tree to answer general range-selection queries.

For example, say A[1..n] = [6, 3, 8, 5, 2, 1, 9, 4, 7, 0] and we want the fourth smallest element in A[3..7]. Suppose the leaves in the left subtree of the wavelet tree are labelled 0--4 and the leaves in the right subtree are labelled 5--9.

We start at the root with the query l = 3, r = 7 and k = 4. The root stores the binary string B[1..n] = 1011001010. We find the rank l_0 = 2 of the first 0 in B[l..r], the rank r_0 = 3 of the last 0 in B[l..r], the rank l_1 = 2 of the first 1 in B[l..r], and the rank r_1 = 4 of the last 1 in B[l..r].

If there were no 0s in B[l..r], then we would recurse on the right subtree immediately with the query l_1 = 2, r_1 = 4 and k' = k = 4. If r_0 - l_0 + 1 were greater than or equal to k, then we would recurse on the left subtree with the query l_0 = 2, r_0 = 3 and k' = k = 4.

Since there are some 0s and r_0 - l_0 + 1 < k, we recurse on the right subtree with the query l_1 = 2, r_1 = 4 and k' = k - r_0 - l_0 + 1 = 2. That is, we're now looking for the second smallest element in [8, 5, 9].

Does this sound reasonable?

Thanks,
Travis

9:54 AM