Monday, January 21, 2008

MapReduce Again

MapReduce is a parallel programming environment. It is successfully used at Google, and perhaps elsewhere. This has generated some inspiration, some frustration and alas, some angst.
  • First some inspiration. The XLDB meeting at Stanford got scientists with *really* large scale data analyses problems to meet with academic researchers, corporate customers and vendors. A somewhat optimistic view there was that these applications needed MapReduce. In principle, a parallel shared-nothing programming system will be useful, but it seems to me that high energy physics and astronomy need sophisticated analyses, different from the kind of analyses at Google for which the design of MapReduce is optimized.
  • Next the frustration. When an Engineer asks me how to find the shortest edit distance between two strings on a single processor machine, I can immediately point them to Dynamic Programming and a classic Algorithms textbook. As a theory+algorithms researcher, I am frustrated when an engineer asks me how to solve a graph problem on MapReduce and I cannot immediately point to a upper/lower bound or a usable theory. See initial theory here.
  • Finally the angst. The database research community tends to be focused more on concepts and abstract solutions, and less on systems. A recent blog article describes some of their angst in not seeing the basic elements of a relational database in MapReduce. This angst is misplaced as comments and articles point out.
MapReduce is a working system that hands-on programmers find effective. More ideas from parallel computing, algorithms, relational databases or whatever that can make it more powerful, useful and more amenable to being analyzed and understood, will be good.