Monday, October 10, 2011

Dremel: Interactive Analysis of Web-Scale Datasets


Dremel is the Google's interactive query system for highly-structured data. In Google's case, the structured data is mostly represented as a nested form, which is very similar to YAML to some extent. The problem is that this kind of data structure is very hard to efficiently handle with traditional row-oriented databases. Interactive query processing with the row-oriented model is virtually impossible due to the bottleneck of disk I/O. While most queries work only with a few columns, the record representation of data requires the read of the entire data.

Dremel introduces the concept of columnar data layout to address this problem. This data structures enables fast column-oriented data aggregation, which is quite usual in Google's interactive queries. Dremel further optimize its disk footprint by compressing field values and data encoding of repetition/definition levels. In combination with multi-level execution trees, Dremel is capable of interactive queries (mostly less than 10 seconds) over billions of records. However, the query execution process is not throughly explored in this paper, leaving many implementation details in question.

No comments:

Post a Comment