Tuesday, September 27, 2011

MapReduce: Simplified Data Processing on Large Clusters


MapReduce is a programming model (paradigm?) for data-intensive workloads on large clusters. MapReduce follows a generalist's approach for various workloads. It defines a template that (surprisingly) many workloads fit well for, and programmers write a simple program consisting of a mapper and a reducer. In this way, it effectively hides many complex issues in large clusters, such as load balancing, fault tolerance, locality optimization, and management burden.

The fundamental trade-off behind the scene is generality versus efficiency. While MapReduce is not the most efficient way to solve all large-data problems apparently, the basic assumption made by the authors is this; it is fine that MapReduce takes two hours to solve a problem with 50% utilization, even if it can be solved in an hour with 100% utilization in the optimal way. In most cases this is true, as most Google's target workloads are not interactive. The low utilization is no problem since there are many other background jobs so that clusters become eventually utilized. At the other extreme, programmers may try to solve a problem in a workload-specific way, where the efficiency is the top priority. The good example would be TritonSort (NSDI this year), which outperformed Hadoop by a factor of six in terms of per-node efficiency.

I think the locality assumption ("network bandwidth is a relatively scarce resource in our computing environment.") made by MapReduce has become much less critical. Network link speed has become 10-100x faster (from 100 Mbps or 1 Gbps to 10 Gbps), while disk bandwidth has changed a little. Also the bisection bandwidth problem is being actively solved. I wonder how these trends affect the framework and bring new challenges.

Also, it would be very interesting if we design an in-memory MapReduce framework on large clusters (Phoenix and Metis are single-machine solutions and I am not sure if there is already a solution for clusters). As Ion said, memory is becoming disks for the 21st century, and more and more workloads are getting fit in memory.What would be the challenges for in-memory MapReduce?

No comments:

Post a Comment