Recently, I attended the Hot Interconnects conference, which was hosted by Credit-Suisse in New York City. This was an especially apt location because of the interesting things that Wall Street is doing with its networks. And on Wall Street, interconnect performance, including low latency, is critical.
An important topic at the conference was multicast. I presented a paper co-authored by IBM that presented test results with an emphasis on full line-rate multicast. We stressed the switch with lots of 10GbE traffic (more traffic than most real networks could ever possibly throw at it) and the switch performance, of course, was awesome. When I presented the paper, I talked up the idea of parallel multicast that falls out from these results.
At the conference, the issue of fairness of information came up quite a bit. For example, if two traders are subscribing to the same market ticker feed (IP multicast), and one trader receives the updates 1 millisecond before the other, this trader has a distinct trading advantage over the other. As a result, Wall Street brokerages subscribing to this feed are even moving their servers in with the exchanges that generate the ticker information, to save the latency incurred from the data having to physically leave the exchange and reach the trader.
So then, the question is, how fair does this need to be? A difference of a millisecond today is described as a “lifetime,” while saving 100 microseconds can directly affect the bottom line of a trading operation. For this reason, you need a network that can deliver updates to each trader at essentially the same time, by taking the message sent from the exchange and copying it in parallel to each trader connected to the network. This is a key feature provided by our FocalPoint switch.
When a frame is multicasted out multiple ports, a copy of the frame will be read out of each egress port (after being scheduled) at essentially the same time assuming those egress ports are not congested. As a result of our shared memory architecture, the maximum amount of skew between these copies is only 67.2 nanoseconds. This latency skew is so small that it is difficult to measure. This minimal amount of skew allows these multicast frames to be delivered through a large-scale data center network with a “skew budget” of just a few hundred nanoseconds or less, providing a high degree of fairness to the traders.
While all the talks at the conference were excellent, I’ve included links to a couple below I thought were closely related to the types of things we are working on:
The other face of on-chip interconnect. Probably the best way to utilize a 10GbE or 40GbE link is to put thousands of cores on one chip, which is why Tilera [http://www.tilera.com] is pretty exciting.
Data Center Switch Architecture. A paper on how to build a monstrous 64,000-port switch using a three-tier Clos architecture. The key here is to build it using off-the-shelf Ethernet switches, making network bandwidth just as commoditized and freely available as computing power has recently become. Pretty cool stuff.
Designing Next-Generation Clusters. This paper compared Nehalem processors with the previous Intel generation (Cloverton) and found that there wasn’t much of a performance leap within a small HPC cluster. Interestingly, there was very little difference when double data rate (DDR 20G) InfiniBand vs. quad data rate (QDR 40G) InfiniBand was used. So even a Nehalem can’t really utilize more than a DDR IB connection. If you can’t really utilize the faster data rate of IB, what is the point of using it?