Monday, December 12, 2011

Erlang - A survey of the language and its industrial applications

Erlang is a parallel functional programming language. The most unique feature of Erlang is "massive concurrency". Erlang programs often consist of millions of (lightweight) processes, and communication between those processes is done with message passing, not shared memory. This message passing makes locks unnecessary in Erlang, which is generally believed to be scalable.



Erlang for many-core CPUs

The good news with "multi"-core CPUs is that we have abundant computing power. The bad new with "many"-core CPUs is that we do not know how to utilize them. We are used to sequential programming languages that hinder us thinking problems in parallel, and it limits our programs to exploit the parallelism in our data. Erlang (even though it predates multi-core CPUs) is inherently parallel in its language design, and its concept of lightweight process naturally fits multi-core CPUs. In theory, a well written Erlang program should be automatically scalable with multi cores.


Erlang for cloud computing

If Erlang programs naturally scale to tens of CPU cores, why not to millions of CPU cores in a data center? The asynchronous message passing concept can be mapped to network communication, no shared memory between processes is exactly same with the shared-nothing architecture of data center servers. Many cool features of Erlang such as fault tolerance and hot swapping makes Erlang very attractive to be the Language for Clouds. Conceptually yes in theory, but lots of work should be done, as current Erlang implementation is not designed to be run in an inter-machine manner.


Framework vs Language

Maybe we do not know how to utilize many-core CPUs within a machine, but we certainly know how to utilize a cluster of servers. We have decent frameworks for many different workloads, with MapReduce, Pregel, and Dryad, to name a few. Many workloads of data centers are very data-intensive, but with relatively small programs (usually less than thousands of lines of code). It is not sure that we need to invite Erlang for our party.

Sunday, December 11, 2011

Bigtable: A Distributed Storage System for Structured Data

Bigtable is a Google's in-house solution to serve petabyte-scale data storage that scales to a thousands of commodity servers. Unlike its title, Bigtable's data model is not very structured (or sophisticated). Basically it provides a sparse table of uninterpreted array of bytes, where each data item is mapped as:

(table, row, column, time) --> string
column and timestamp can be skipped to iterate over the row.

Bigtable adds some seasoning to its simple data model for some cool features. The concept of column families gives better management of data, allows access control, and becomes the unit of compression. Timestamping (each cell in a Bigtable can contain multiple versions of the same data) can be used for data history or concurrency control depending on application's needs.

The internal implementation of Bigtable heavily relies on Google's other in-house service components. Chubby is used to handle many issues of general distributed storage systems, such as master election, bootstrapping, access control management, etc. GFS is used as a  Bigtables underlying storage layer.

Maybe I am too cynical, but I would like to ask a (meta) question: Why did Google publish this paper to an academic conference? Why is this valuable in terms of research? How is this meaningful outside Google? What lessons can a reader learn from this work? I am not saying that this is not a good paper, but just wondering what is the difference between good engineering work and good research, as a newbie systems (where those two are very vague) researcher.

An Operating System for Multicore and Clouds: Mechanisms and Implementation

This paper introduces a new operating system called fos (Factored Operating System). fos can be summarized as follows, without buzz words as much as I could.
  • Microkernel: most features of the kernel are distributed into user space
  • Single system image across cores and machines
  • communication between cores (intra- or inter-machine) is done by messaging passing
  • name mapping for addressing
As far as I understand, the novelty of fos over traditional microkernel research comes from inter-machine message passing. This idea seems to be  very similar to the multikernel model of Barrelfish project, but the only mention about Barellelfish in this paper is: "but fos is focusing more on how to parallelize the system servers as well as addresses the scalability on chip and in the cloud". I wonder how Barrelfish project team would make a comment on it.

Let us review some "facts" mentioned by the paper:
  1. Servers are getting more and more CPU cores. Traditional operating systems do not scale well with many-core CPUs.
  2. Cloud computing introduces many hard problems, such as administration complexity, system-wide resource management, etc.
Obviously both issues are true and needed to be solved, but I am not sure if they should be solved with a single solution. Systems research (well, at least networking research) has good tradition called layering.

Suppose that we have the IP protocol, which provides global naming and routing over the Internet. But it would not be very compelling if someone comes up with a new protocol called XP, saying "XP is a new protocol that solves the limited address space of IP and supports reliable data transmission". Well, rather than that, TCP for reliable data transport and IPv6 for larger address space would be much more convincing to me.

This argument can be done in the same way to fos. A scalable operating system like Corey, and a cloud resource management system such as Mesos or Omega would be a much cleaner approach, in my humble opinion.

Hedera: Dynamic Flow Scheduling for Data Center Networks

Hedera begins with two following observations:
  • Data flows in data centers are diverse. Some flows are elephant (large bulk transmission over long-lived connections) while others are not. Even worse, there is no easy way to predict the traffic matrix between hosts.
  • Typical data center topologies consists of multi-rooted trees for better bisection bandwidth to workaround limited port density of switches, so there are multiple paths between any given pair of hosts. ECMP is used to utilize the multi-path on a flow basis.
When these two comes together, things are broken. The authors explains ECMP-based flow forwarding does not work very well with diverse flows. Since load balancing done by ECMP is quite random (a hash of flow 5-tuple), it does not account either current network utilization or flow size. Suppose that two elephant flows are forwarded into the same link, while another link is idle. This load imbalance may underutilize the potential bisection bandwidth.

Hedera addresses this problem with dynamic flow scheduling. The central scheduler keeps track of flow information between hosts and optimizes it by assigning flows to non-conflicting paths. As this optimization is very computationally intensive, Hedera comes up with Global First Fit (in other words, a greedy algorithm) and Simulated Annealing (WHAT??? -5 points here).

My biggest impression on this paper is that the authors should have discussed about other alternatives before digging a tunnel through the hill. In this paper, they assume ECMP on a flow basis as an enemy and make only one sentence about packet-by-packet ECMP scheme: TCP's performance is significantly reduced when packet reordering occurs because it interprets that as a sign of packet loss due to network congestion.
I was really disappointed at this point, since that is exactly what TCP selective ACK is for. I am really curious about how well TCP SACK and packet-by-packet ECMP would work. In fact, many ISPs are utilizing ECMP over multiple links on a packet basis, and it's surprisingly working well even in WAN environment. In data centers, if all TCP packets are distributed packet-by-packet, the loads of links will be almost even, so delay jitters introduced by the multi-path should be minimal (a few micro seconds). Even if TCP SACK works poorly in this situation, maybe we can add a thin layer that reorders packets into the endhost network stack. This way is much simpler, elegant, not dependent on programmable OpenFlow switches, and fully decentralized. I would not be convinced by Hedera if the authors do not make good explanation on it.

RAMClouds: Scalable High-Performance Storage Entirely in DRAM


I made a presentation on RAMCloud in my class. The slides are available here.

RAMCloud is an actively ongoing project in Stanford University, aiming a new storage layer for data centers. RAMCloud has unique characteristics compared to previous data center storage systems, and the sharpest difference is that RAMCloud stores all data in DRAM, but data is permanent unlike memcached.

This all-data-in-DRAM feature leads to interesting consequences, such as:

  • RAMCloud has orders of magnitude higher performance (10us latency, 1M operations/s) than other disk-based storages.
  • RAMCloud should be free (at least in theory) from the tail latency problem, which is mainly caused by disk seek latency.
  • There is no concept of replica in RAMCloud; every data item is served by a single machine. In other storage systems, replica is used for two main reasons: load balancing and fault tolerance. Since RAMCloud has much higher per-node performance, load balancing is not a critical issue. In terms of fault tolerance, RAMCloud has a cool feature called "Fast Recovery", which recovers the data of dead node in 1-2 seconds. The fast recovery process was covered in the paper at SOSP 2011.
  • Even if the data model supported by RAMCloud is a simple key-value storage, its low latency makes more complex (e.g., relations) data structures feasible on top of RAMCloud.
RAMCloud still has the hurdles to overcome as follows:
  • Currently, the high throughput and low latency of their prototype heavily depends on Infiniband (kind of free riding). It is not sure if it can scale to data center scale.
  • Tail latency problem comes also from overloaded network/server, not just from disks.
  • Unlike other storage systems from industry (where their products came from their demands), it is somewhat uncertain that what would be the killer application of RAMCloud.



Monday, November 7, 2011

Scalable data center networking: Portland, VL2, and c-Through


Commodity Ethernet switches offer very useful semantics to end hosts. An end host can be added or removed without any manual configuration (plug-and-play). Even if the physical port to a specific host is changed, the communication with the server is not disrupted. Most commercial L2 switches provide full bisection bandwidth between hosts, which is particularly important with MapReduce-like workloads.

However, these good characteristics do not scale beyond one switch. Each switch has limited port density (typically 24-48), small SRAM limiting the maximum size of MAC learning table. At the scale of data centers, this scalability problem introduces many issues: operational cost for manual configuration, obsession with locality due to limited bisection bandwidth, etc. VL2 and Portland address this problem with the same goal; data center network should be viewed as a (virtually) single layer-2 domain.

Portland and VL2 have slightly different scopes and approaches. For example, VL2 works with modified end hosts and relies on some layer-3 functionalities of commodity switches such as ECMP and IP-in-IP tunneling, while Portland only relies on simple layer-2 functionalities but assumes some modifications on switches (MAC rewriting). They both have some notion of centralized service to maintain location-mapping state.

c-Through takes an interesting approach to achieve larger bandwidth between servers. Ethernet switches are suitable for low latency, bursty, and uniform traffic. On the other hand, optical switches provides much more bandwidth for stable traffic. Their theoretical bandwidth is unlimited, as they do not interpret the data itself but just forward optical signal between ports. The disadvantage of optical switches is high circuit switching time (at milliseconds order) to mechanically adjust the mirrors. They also need external schedulers to manage the switching fabric.

In SIGCOMM 2010, c-Through and Helios independently introduced the use of optical switches to augment datacenter networks  with optical switches. While c-Through makes end hosts to decide whether to offload traffic to the optical network, in Helios it is up to switches. This difference (where to make modifications) is very similar to the relationship between Portland and VL2, interestingly.

Wednesday, November 2, 2011

Improving Per-Node Efficiency in the Datacenter with New OS Abstractions

This paper introduces a new operating system, Akaros, being built by an army of Eric Brewer. The main goal of Akaros is to fully utilize the potential peak performance of commodity large-scale SMP servers. The authors point out that most efforts on datacenter network have been made for scalability across multiple notes, while missing lots of opportunities in improving per-node efficiency (I totally agree).

The key principle of Akaros is to give applications more understanding and control of underlying hardware for optimal performance. This may seem similar to the Corey many-core operating system from MIT, but Akaros takes much less radical way than Corey's exokernel architecture.

I summarize my comments and concerns:
  • Many features of MCP, such as process-core affinity and core isolation, are already available in current Linux kernels. Many applications do not adopt them because of portability issues or low understanding of them.
  • Giving applications more control with MCP abstraction would lead to suboptimal performance, as each application has an isolated view. I think the operating system should be in charge of the control for system-wide optimal performance.
  • Linux 2.6 kernel already has the unified asynchronous interface for I/O internally.
  • Zero copy I/O is now commonly considered unnecessary, as the evolution of CPU micro-architecture and new technologies such as memory-to-memory DMA have made it efficient enough. In most cases, bookkeeping overhead of zero-copy I/O (e.g., page management) exceeds the cost of copy itself.