Tuesday, 26 April 2011

Wherefore art thou IP packet? Make haste Windows 2008 R2

How camest thou hither, tell me, and wherefore?
Lest the bard fray more, the topic is of PGM haste in the homogeneous environment, and the unfortunate absence of said haste. We take performance readings of PGM across multiple hosts and present a visual heat map of latency to provide insight to the actual performance. Testing entails transmission of a message onto a single LAN segment, the message is received by a listening application which immediately re-broadcasts the message, when the message is received back at the source the round-trip-time is calculated using a single high precision clock source.
Performance testing configuration with a sender maintaining a reference clock to calculate message round-trip-time (RTT).
The baseline reading is taken from Linux to Linux, the reference hardware is an IBM BladeCentre HS20 with Broadcom BCM5704S gigabit Ethernet adapters and the networking infrastructure is provided by a BNT fibre gigabit Ethernet switch.
Latency in microseconds from Linux to Linux at 10,000 packets-per-second one-way.
The numbers themselves are of minor consequence, for explanation at 10,000 packets-per-second (pps) there is a marked grouping at 200μs round-trip-time (RTT) latency.  The marketing version would be 20,000pps, as we consider 10,000pps being transmitted, and 10,000pps being received simultaneously, with a one-way latency of 100μs.  Also note that the packet reflection is implemented at the application layer much like any end-developer written software using the OpenPGM BSD socket API, compare this with alternative testing configurations that may operate at the network layer and bypass the effective full latency of the networking stack and yield to disingenuous figures.

20,000 feet (6,096 meters) Sir Hillary
Onward and upward we must go, with an IFG of 96ns the line capacity of a gigabit network is 81,274pps leading to the test potential limit of 40,000pps one-way with a little safety room above.

Latency in microseconds from Linux to Linux at 20,000 packets-per-second one-way.
At 20,000pps we start to see a spread of outliers but notice the grouping remains at 200μs.

Latency in microseconds from Linux to Linux at 30,000 packets-per-second one-way.
At 30,000pps outlier latency jumps to 1ms.

Latency in microseconds from Linux to Linux at 40,000 packets-per-second on-way.
At 40,000pps you are starting to see everything start to break down with the majority of packets from 200-600μs.  Above 40,000pps the network is saturated and packet loss starts to occur, packet loss initiating PGM reliability and consuming more bandwidth than is available for full speed operation.

Windows 2008 R2 performance
Here comes Windows.  Windows!
Latency in microseconds from Windows to Linux at 10,000 packets-per-second one-way.
Non-blocking sockets at 10,000pps show a grouping just as Linux at 200μs but the spread of outliers reaches as far as 2ms. This is highlights the artifacts of a low scheduling granularity and an inefficient IP stack.

Spot the difference
A common call to arms on Windows IP networking hoists the flag of Input/Output Completion Ports or IOCP as a more efficient design to event handling as it depends upon blocking sockets, reducing the number of system calls to send and receive packets, and permits zero-copy memory handling.

Latency in microseconds from Windows to Linux at 10,000 packets-per-second one-way using IOCP.
The only difference is a slightly lighter line at ~220μs, the spread of latencies is still rather broad.
Increasing the socket buffer sizes permits the test to run at 20,000pps but with heavy packet loss requiring the PGM reliability engine yielding 1-2 seconds average latency.

High performance Windows datagrams
All the popular test utilities use blocking sockets to yield remotely reasonable figures. These include netperf, iperf, ttcp, and ntttcp - Microsoft's port of ttcp for Windows sockets. These sometimes yield higher raw numbers than iperf on Linux which considering the previous results is unexpected.
  • iperf on Linux yields ~70,000pps.
  • PCATTCP on Windows yields ~90,000pps.
  • ntttcp on Windows yields ~190,000pps.
  • iperf on Windows yields ~20,000pps.
There appears to be either a significant driver flaw or severe Windows limitation in transmit interrupt coalescing as the resultant bandwidths from testing yield ~800mbs to a Windows host, but only ~400mbs from a Windows host. Drivers for the Broadcom Ethernet adapter have undergone many revisions from 2001 through to present, all consistently show weak transmit performance even with TCP transports.

Windows registry settings
To achieve these high performance Windows results the following changes were applied.
  • Disable the multimedia network throttling scheduler. By default Windows limits streams to 10,000pps when a multimedia application is running.
Under HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\WindowsNT\CurrentVersion\Multimedia\SystemProfile set NetworkThrottlingIndex, type REG_DWORD32, value set to 0xffffffff.
  • Two settings define an IP stack path for datagrams, by default datagrams above 1024 go through a slow locked double buffer, increase this to the network MTU size, i.e. 1500 bytes.
Under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\AFD\Parameters:
  • FastSendDatagramThreshold, type REG_DWORD32, value set to 1500. 
  • FastCopyReceiveThreshold, type REG_DWORD32, value set to 1500.

  • Without hardware acceleration for high resolution time stamping incoming packets will be tagged with the receipt time at expense of processing time.  Disable the time stamps on both sender and receiver. This is performed by means of the following command:

netsh int tcp set global timestamps=disabled
  • A firewall will intercept all packets causing increased latency and processing time, disable the filtering and firewall services to ensure direct network access.  Disable the following services:
  • Base Filtering Engine (BFE) 
  • Windows Firewall (MpsSvc)

Additional notes
The test hardware nodes are single core Xeons and so Receive Side Scaling (RSS) does not assist performance.  Also, the Ethernet adapters do not support Direct Cache Access (DCA) also known as NetDMA 2.0 which should improve performance by reducing system time required to send or receive packets.

To significantly increase the default socket buffer size you can set a multiplier number under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\AFD\Parameters:
BufferMultiplier
, type REG_DWORD32, value set to 0x400.

No comments:

Post a Comment