Wednesday, October 12, 2011

DryadLINQ and FlumeJava for easy, flexible large data processing


MapReduce had been a de facto standard for data-parallel processing for datacenter clusters. Its simple model is quite easy to understand, and surprisingly fits for many common workloads. However, as datacenter computing matures, people found that some workloads are awkward (or at least not straightforward) to run on the MR framework. Some operations such as "Join" turns out to be very tricky to be implemented on top of MR. MR programs become not clean and small any more for non-trivial, real-world computations. Since many analytics jobs are represented in a series of MR stages, disk I/O overhead between the stages becomes very expensive. Also, while MR effectively hides the internal complexity from programmers, some crucial applications may need to be throughly tuned for optimal performance.

Lots of efforts have made to remedy these issues. One approach is to build a higher-level, SQL-like query framework on top of MapReduce (e.g, Pig and Hive). Another approach is to build a more general programming interface than MR. Dryad and Flume are good examples of this approach. Here "more general" means they support more primitive operations to support arbitrary data-parallel processing. For example, traditional MR can be implemented (assembled) on the both frameworks by using their primitives.

DryadLINQ is a more human-friendly, less flexible application of Dryad. LINQ is a SQL-like query integrated in .NET languages that look like:

var results =  from c in SomeCollection
               where c.SomeProperty < someValue * 2
               select new {c.SomeProperty, c.OtherProperty};

and DryadLINQ seamlessly parallelize LINQs on top of Dryad. Since raw Dryad programming could be very painful even for experienced programmers, this kind of abstraction is priceless. On the other hand, it is not quite clear if DryadLINQ would be influential since this approach relies on the specialized language support and is very closed. I haven't heard of any open-source DryadLINQ clone, and it is still not available even in Azure. FlumeJava takes a library approach, so it does not need any modification on Java. The implementation detail of FlumeJava is not clear at all, partly because this work is published at PLDI. The only mention about the implementation is "We have implemented the FlumeJava library, optimizer, and executor, building on MapReduce and other lower-level services available at Google", which is quite disappointing.

No comments:

Post a Comment