Sunday, December 11, 2011

Bigtable: A Distributed Storage System for Structured Data

Bigtable is a Google's in-house solution to serve petabyte-scale data storage that scales to a thousands of commodity servers. Unlike its title, Bigtable's data model is not very structured (or sophisticated). Basically it provides a sparse table of uninterpreted array of bytes, where each data item is mapped as:

(table, row, column, time) --> string
column and timestamp can be skipped to iterate over the row.

Bigtable adds some seasoning to its simple data model for some cool features. The concept of column families gives better management of data, allows access control, and becomes the unit of compression. Timestamping (each cell in a Bigtable can contain multiple versions of the same data) can be used for data history or concurrency control depending on application's needs.

The internal implementation of Bigtable heavily relies on Google's other in-house service components. Chubby is used to handle many issues of general distributed storage systems, such as master election, bootstrapping, access control management, etc. GFS is used as a  Bigtables underlying storage layer.

Maybe I am too cynical, but I would like to ask a (meta) question: Why did Google publish this paper to an academic conference? Why is this valuable in terms of research? How is this meaningful outside Google? What lessons can a reader learn from this work? I am not saying that this is not a good paper, but just wondering what is the difference between good engineering work and good research, as a newbie systems (where those two are very vague) researcher.

No comments:

Post a Comment