Sunday, December 11, 2011

Hedera: Dynamic Flow Scheduling for Data Center Networks

Hedera begins with two following observations:
  • Data flows in data centers are diverse. Some flows are elephant (large bulk transmission over long-lived connections) while others are not. Even worse, there is no easy way to predict the traffic matrix between hosts.
  • Typical data center topologies consists of multi-rooted trees for better bisection bandwidth to workaround limited port density of switches, so there are multiple paths between any given pair of hosts. ECMP is used to utilize the multi-path on a flow basis.
When these two comes together, things are broken. The authors explains ECMP-based flow forwarding does not work very well with diverse flows. Since load balancing done by ECMP is quite random (a hash of flow 5-tuple), it does not account either current network utilization or flow size. Suppose that two elephant flows are forwarded into the same link, while another link is idle. This load imbalance may underutilize the potential bisection bandwidth.

Hedera addresses this problem with dynamic flow scheduling. The central scheduler keeps track of flow information between hosts and optimizes it by assigning flows to non-conflicting paths. As this optimization is very computationally intensive, Hedera comes up with Global First Fit (in other words, a greedy algorithm) and Simulated Annealing (WHAT??? -5 points here).

My biggest impression on this paper is that the authors should have discussed about other alternatives before digging a tunnel through the hill. In this paper, they assume ECMP on a flow basis as an enemy and make only one sentence about packet-by-packet ECMP scheme: TCP's performance is significantly reduced when packet reordering occurs because it interprets that as a sign of packet loss due to network congestion.
I was really disappointed at this point, since that is exactly what TCP selective ACK is for. I am really curious about how well TCP SACK and packet-by-packet ECMP would work. In fact, many ISPs are utilizing ECMP over multiple links on a packet basis, and it's surprisingly working well even in WAN environment. In data centers, if all TCP packets are distributed packet-by-packet, the loads of links will be almost even, so delay jitters introduced by the multi-path should be minimal (a few micro seconds). Even if TCP SACK works poorly in this situation, maybe we can add a thin layer that reorders packets into the endhost network stack. This way is much simpler, elegant, not dependent on programmable OpenFlow switches, and fully decentralized. I would not be convinced by Hedera if the authors do not make good explanation on it.

No comments:

Post a Comment