Lest the bard fray more, the topic is of PGM haste in the homogeneous environment, and the unfortunate absence of said haste. We take performance readings of PGM across multiple hosts and present a visual heat map of latency to provide insight to the actual performance. Testing entails transmission of a message onto a single LAN segment, the message is received by a listening application which immediately re-broadcasts the message, when the message is received back at the source the round-trip-time is calculated using a single high precision clock source.
Performance testing configuration with a sender maintaining a reference clock to calculate message round-trip-time (RTT). |
The baseline reading is taken from Linux to Linux, the reference hardware is an IBM BladeCentre HS20 with Broadcom BCM5704S gigabit Ethernet adapters and the networking infrastructure is provided by a BNT fibre gigabit Ethernet switch.
Latency in microseconds from Linux to Linux at 10,000 packets-per-second one-way. |
20,000 feet (6,096 meters) Sir Hillary
Onward and upward we must go, with an IFG of 96ns the line capacity of a gigabit network is 81,274pps leading to the test potential limit of 40,000pps one-way with a little safety room above.
Latency in microseconds from Linux to Linux at 20,000 packets-per-second one-way. |
Latency in microseconds from Linux to Linux at 30,000 packets-per-second one-way. |
Latency in microseconds from Linux to Linux at 40,000 packets-per-second on-way. |
Windows 2008 R2 performance
Here comes Windows. Windows!
Latency in microseconds from Windows to Linux at 10,000 packets-per-second one-way. |
Spot the difference
The only difference is a slightly lighter line at ~220μs, the spread of latencies is still rather broad.
A common call to arms on Windows IP networking hoists the flag of Input/Output Completion Ports or IOCP as a more efficient design to event handling as it depends upon blocking sockets, reducing the number of system calls to send and receive packets, and permits zero-copy memory handling.
Latency in microseconds from Windows to Linux at 10,000 packets-per-second one-way using IOCP. |
Increasing the socket buffer sizes permits the test to run at 20,000pps but with heavy packet loss requiring the PGM reliability engine yielding 1-2 seconds average latency.
High performance Windows datagrams
All the popular test utilities use blocking sockets to yield remotely reasonable figures. These include netperf, iperf, ttcp, and ntttcp - Microsoft's port of ttcp for Windows sockets. These sometimes yield higher raw numbers than iperf on Linux which considering the previous results is unexpected.- iperf on Linux yields ~70,000pps.
- PCATTCP on Windows yields ~90,000pps.
- ntttcp on Windows yields ~190,000pps.
- iperf on Windows yields ~20,000pps.
Windows registry settings
To achieve these high performance Windows results the following changes were applied.- Disable the multimedia network throttling scheduler. By default Windows limits streams to 10,000pps when a multimedia application is running.
Under HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\WindowsNT\CurrentVersion\Multimedia\SystemProfile set NetworkThrottlingIndex, type REG_DWORD32, value set to 0xffffffff.
- Two settings define an IP stack path for datagrams, by default datagrams above 1024 go through a slow locked double buffer, increase this to the network MTU size, i.e. 1500 bytes.
Under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\AFD\Parameters:
- FastSendDatagramThreshold, type REG_DWORD32, value set to 1500.
- FastCopyReceiveThreshold, type REG_DWORD32, value set to 1500.
- Without hardware acceleration for high resolution time stamping incoming packets will be tagged with the receipt time at expense of processing time. Disable the time stamps on both sender and receiver. This is performed by means of the following command:
netsh int tcp set global timestamps=disabled
- A firewall will intercept all packets causing increased latency and processing time, disable the filtering and firewall services to ensure direct network access. Disable the following services:
- Base Filtering Engine (BFE)
- Windows Firewall (MpsSvc)
The test hardware nodes are single core Xeons and so Receive Side Scaling (RSS) does not assist performance. Also, the Ethernet adapters do not support Direct Cache Access (DCA) also known as NetDMA 2.0 which should improve performance by reducing system time required to send or receive packets.
To significantly increase the default socket buffer size you can set a multiplier number under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\AFD\Parameters:
BufferMultiplier, type REG_DWORD32, value set to 0x400.
No comments:
Post a Comment