MapReduce Prefix Sums
Input is a set of items (i, v_i), for 1=1, ..., n, distributed in arbitrary order among m machines with memory M each. Output should be (i, s_i=\sum_{j\leq i} v_j), for all i, again distributed in arbitrary order on the machines. This is the prefix sums problem for massive, unordered, distributed (MUD) data. We need to solve this using MapReduce. I can design an algorithm that will take 3 phases of Map+Reduce with suitable assumptions about n vs M, maybe even 2. Can someone show this can not be done in one phase of Map+Reduce?
Labels: aggregator
5 Comments:
Why don't you post the question on http://cstheory.stackexchange.com
Hi Muthu! I guess the model is not 100% clear in my head.
Do you care about the work? How does your algorithm run? Does it do O(n^2) work?
Another model clarification question:
Can you do summation, i.e., compute \sum_{i=1,..,n} v_i, in a single map-reduce phase? (If so, how?)
Each reduce can compute at most M (partial) prefix sums in a round. Then if M<n, you need at least two rounds. In general, at least (\log_M n) rounds?
Dear Mihai, Hagit and Francesco,
Apologies for the late response. Someone (u know who) told me "there is no bigger impediment to scientific progress than the stroller in the hallway."
* What I had in mind was the trivial algorithm. Each machine j works on jn/M + 1 ... (j+1)n/M items and computes required partial prefix sums. Then all prefix sums for the n/M sized problem is computed (I assumed, this will fit into mem of one machine. M like 1000's, internal mem like xGB, n like xTB) and then finally finish up for each piece. So, total "work" is O(n).
* This does not take advantage of fun things Reduce can do. For example, we can assume \sum_{i=1,..,n} v_i can be done in single MR phase. Can other computations be piggybacked on the underlying tree computation?
Part of the problem is to fix some capability for MR.
-- Metoo
Post a Comment
<< Home