Saturday, 8 October 2011


Under ideal conditions there is a constant stream of data on the network and every call to pgm_recv returns data and there is no data loss or dropped packets.  Ideal conditions are rare though, we might see bursty data from senders, senders may close or crash, packets may be lost.
At the most basic level we need to be maintained that the senders exist, senders notify their presence by repeated broadcast of SPM packets.  Packet loss, closed or crashed applications would cause an absence of SPM broadcasts and this situation can be caught by a timer.  If no packets are seen within say 30 seconds we consider the sender to no longer to be operational.
The receive window extends beyond that to monitor every incoming packet, NAK elimination to prevent transmission of duplicate NAKs from the same or different receivers, and retransmission of NAKs for when the retransmit request itself was lost in the network.  Each state is driven by configurable timers or timeouts.

This means that as soon as a single packet from a sender is received the common return values expected are PGM_IO_STATUS_NORMAL for data and PGM_IO_STATUS_TIMER_PENDING for no-data or receive-state transitions.


Assuming non-blocking sockets, the return value PGM_IO_STATUS_WOULD_BLOCK appears when the call to pgm_recv cannot immediately return due to no available contiguous data.

The important note about this return value is that it indicates that there are no known senders and the receive state engine is waiting for starting data packets to begin processing.


When calling pgm_recv the return value for indicating data was successfully read is PGM_IO_STATUS_NORMAL, this is obviously the ideal case and assuming a constant stream of data to read from the network.

OpenPGM follows a reactor model, calling pgm_recv will then read a packet from the underlying UDP or RAW socket.  If the packet is an original data packet, called ODATA, it will be inserted into the receive window and if-and-only-if the sequence is contiguous to the current lead the payload can be returned to the application.

Friday, 8 July 2011

4. Non-operational IPv4 adapters

So the question arises, if we detect an adapter that is not "operationally up" and hence with no prefix, if we can assume that IPv4 link-local addresses always have a 16-bit prefix what are the others?

First obvious candidate would be a static configured host IP address with the "media disconnected", i.e. no network cable.

Let's see how much information Windows grants the typical CJ.
Ethernet adapter Local Area Connection:

  Media State . . . . . . . . . . . : Media disconnected
  Connection-specific DNS Suffix  . :
  Description . . . . . . . . . . . : Broadcom NetXtreme Gigabit Ethernet
  Physical Address. . . . . . . . . : C4-2C-03-21-78-AB
  DHCP Enabled. . . . . . . . . . . : No
  Autoconfiguration Enabled . . . . : Yes
We see that the adapter exists and absolutely no indication of an address other than DHCP is disabled.  Let's look at the results of IPv4 adapter enumeration using the GetAdaptersInfo API.
Info: #13 name {12D5DC53-E214- IPv4
   scope 0 status UP   loop NO  b/c YES m/c YES
Info: #11 name {61F5BC1C-1D95- IPv4
   scope 0 status UP   loop NO  b/c YES m/c YES
Info: #10 name {FFF6B15A-5B5C- IPv4
   scope 0 status UP   loop NO  b/c YES m/c YES
Info: #19 name {D8ED3DA1-9FAC- IPv4
   scope 0 status UP   loop NO  b/c YES m/c YES
The Ethernet adapter is index #11 and the Windows 2000 API returns the host IP address as  Let's look at the Windows XP API, GetAdaptersAddresses, excluding IPv6 addressing.
Info: #13 name {12D5DC53-E214- IPv4
   scope 0 status DOWN loop NO  b/c NO  m/c YES
Info: #11 name {61F5BC1C-1D95- IPv4
   scope 0 status DOWN loop NO  b/c NO  m/c YES
Info: #11 name {61F5BC1C-1D95- IPv4
   scope 0 status DOWN loop NO  b/c NO  m/c YES
Info: #10 name {FFF6B15A-5B5C- IPv4
   scope 0 status UP   loop NO  b/c NO  m/c YES
Info: #19 name {D8ED3DA1-9FAC- IPv4
   scope 0 status UP   loop NO  b/c NO  m/c YES
Info: #1 name {846EE342-7039- IPv4
  scope 0 status UP   loop YES b/c NO  m/c YES
Windows is returning two different interfaces for the adapter, one is a IPv4 link-local prefixed address, and the other is the configured static host IP address

In conclusion we find that Windows cannot provide the netmask or network prefix for any adapter that is not marked "operationally up".  The older Windows 2000 API cannot even report the IP address of such adapters, the newer Windows XP API fairs a little better but we can only determine the prefix of IPv4 link-local addresses without additional information.

Thursday, 7 July 2011

A Cupcake and a Teredo

IPv6 is starting to garner more interest around the world with a multitude of options being presented for co-operation with existing IPv4 hosts. Several schemes already deployed are targeting how to ensure the IPv6 Internet is accessible to IPv4-only users.  When you try to access an IPv6 website such as and an IPv6 address is returned the scheme will tunnel the request over the IPv4 Internet to a host that can speak both IPv4 and IPv6 that forwards the request onto the IPv6 target.

IPv6 similar to IPv4 was designed with a specific spit between which part of an address refers to the network and that to the host.
"Unicast and anycast addresses are typically composed of two logical parts: a 64-bit network prefix used for routing, and a 64-bit interface identifier used to identify a host's network interface."

However there is this is purely a recommendation not an in concrete part of the protocol addressing.  As implied non-unicast addresses may have a different network prefix size, for example multicast addresses have a 8-bit prefix.

As it turns out different adapter types also can have different network prefixes.  A common example is a tunnel such as a VPN, typically a point-to-point communication between your computer and a VPN concentrator.  A point-to-point link has no multicast, broadcast, or sub-networking and hence will have a full 128-bit prefix.  Then, we have Teredo, and a cupcake.

A Cupcake

A raspberry tiramisu cupcake from "Cupcake Wars".

When configuring computer networking most CJ's will be aware of the basic required numbers, a host IP address, a subnet mask, and a default gateway.  Here is a typical view on a Windows host with the ipconfig command.

Ethernet adapter Local Area Connection:

Connection-specific DNS Suffix  . :
IP Address. . . . . . . . . . . . :
Subnet Mask . . . . . . . . . . . :
Default Gateway . . . . . . . . . :
On OSX the ifconfig command might generate something similar.

en0: flags=8863<up,broadcast,smart,running,simplex,multicast> mtu 1500</up,broadcast,smart,running,simplex,multicast>
inet netmask 0xffffff00 broadcast
As you can see the tuple of address and netmask is quite visible for both.  A trip over to MSDN with the GetAdaptersInfo function presents an API to enumerate adapters   The API returns a linked list of IP_ADAPTER_INFO objects for each adapter with an IP_ADDR_STRING for each interface on the adapter containing the host IP address and subnet IP address.  As you might note the MSDN articles indicate that Windows XP developers should use the GetAdaptersAddresses function as this also presents IPv6 addresses and makes it easier to determine unidirectional adapters such as satellite Internet connections.

The Windows XP API generates a list of IP_ADAPTER_ADDRESSES objects which vary in size depending on which Windows version is operating.  The list of interfaces per adapter is found in the IP_ADAPTER_UNICAST_ADDRESS list but includes no netmask.  On Windows Vista and later a field called OnLinkPrefixLength is provided, the prefix indicating the length of the network IP address, presumably to reduce user errors in specifying illegal masks such as  So what about Windows XP developers?

A list of IP_ADAPTER_PREFIX objects is provided for Windows XP SP1 users, indicating that nothing is available pre-service pack.  The list is one prefix per interface and with Windows XP has been a one-to-one mapping with the unicast interface list.  With Windows Vista this list has been expanded, breaking compatibility with early assumptions.

"In addition, the linked IP_ADAPTER_UNICAST_ADDRESS structures pointed to by the FirstUnicastAddress member and the linked IP_ADAPTER_PREFIX structures pointed to by the FirstPrefix member are maintained as separate internal linked lists by the operating system. As a result, the order of linked IP_ADAPTER_UNICAST_ADDRESS structures pointed to by the FirstUnicastAddress member does not have any relationship with the order of linked IP_ADAPTER_PREFIX structures pointed to by the FirstPrefix member.
On Windows Vista and later, the linked IP_ADAPTER_PREFIX structures pointed to by the FirstPrefix member include three IP adapter prefixes for each IP address assigned to the adapter. These include the host IP address prefix, the subnet IP address prefix, and the subnet broadcast IP address prefix. In addition, for each adapter there is a multicast address prefix and a broadcast address prefix.
On Windows XP with SP1 and later prior to Windows Vista, the linked IP_ADAPTER_PREFIX structures pointed to by the FirstPrefix member include only a single IP adapter prefix for each IP address assigned to the adapter."

Notice the wording, the list "includes" prefixes but no order or existence is guaranteed.  So lets investigate some adapters and see what results are provided by this API.

1. The Loopback Adapter

Windows prefers to hide the loopback adapter and interfaces from ipconfig but other platforms aren't so shy.

lo0: flags=8049<up,loopback,running,multicast> mtu 16384
        inet6 fe80::1%lo0 prefixlen 64 scopeid 0x1
        inet netmask 0xff000000
        inet6 ::1 prefixlen 128</up,loopback,running,multicast>
Notice OSX adds a link-local interface to the loopback adapter contrasting with Windows and Linux.

lo        Link encap:Local Loopback
          inet addr:  Mask:
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1

We have the standard IPv4 address with a network prefix of 8-bits, IPv6 comes with address ::1 and full prefix of 128-bits.  Now back to the Windows XP API.
There is no IPv6 subnet IP address as presumably an optimisation to not having a subnet, the host IP address, and multicast address; IPv6 does not not include broadcast.  On IPv4 we have the subnet IP address, host IP address, subnet broadcast IP address, all-host multicast and broadcast IP addresses.

2. Active IPv6 adapter

In IPv6 land there are usually two interfaces per adapter, one link-local, and one global-scope.

A new entry appears not documented on MSDN, the IPv6 unspecified address ::/0 which a zero length prefix indicating that this adapter hosts the default route for IPv6.  There is no IPv4 equivalent ever listed (, and the prefix only appears when global-scope addresses are enabled.

Surprisingly the Windows Vista and later API returns a prefix length of 64-bits for a Teredo sourced address despite the routing table showing 32-bits and all documentation referring to standard prefix of 2001:0::/32.  This is assumed to be a defect in Windows 7 SP1.

3. Point-to-point tunnels

Continuing the optimisation seen hiding the subnet IP address for when there is no subnet with PTP interfaces with no support for broadcast or multicast traffic then the only prefix listed is for the host IP address.
This address is actually a IPv4 mapped link-local IPv6 address for a IPv4 PTP VPN tunnel.

4. Non-operational IPv4 adapters

It is common to find that many hosts these days contain network adapters that are not in use, for example additional Ethernet ports, Bluetooth, and even virtual WiFi or virtual machine host-only adapters.  Surprisingly Windows has a special treatment for these adapters.

As the adapter is non-operational Windows is hiding the IPv4 address from the prefix list hence making it impossible to determine the prefix length and netmask.  Bizarrely though IPv4 all-host multicast and broadcast addresses are still returned.  This may be an artifact of IPv4 link-local addressing on Windows that permits the use of IPv4 broadcast and multicast over a non-configured adapter for device discovery.  Therefore you can pull the unicast address and detect for a link-local address and assume the standard 16-bit prefix.

With a static address interface that is non-operational it appears the only method to determine the network prefix length, and hence the netmask is to use the older Windows 2000 GetAdaptersInfo function.

Tuesday, 26 April 2011

Real-Time Linux

Real-Time Linux, a set of patches to improve scheduling consistency and available in SUSE Linux Enterprise Real Time and RedHat Enterprise MRG.

Linux latency in microseconds at 10,000 packets-per-second one-way.
Y-axis is latency in microseconds, X-axis is time in seconds. Chart records test of 10,000 packets-per-second send out and received, a total of 20,000 datagrams-per-second. The left hand side shows Linux 2.6.36 with normal scheduling (SCHED_OTHER) with a tight grouping around 200μs; the right hand side shows Linux 2.6.26 with real-time scheduling (SCHED_FIFO) with tighter grouping at 200μs but larger spread of outliers.

Linux latency in microseconds at 20,000 packets-per-second one-way.
Normal scheduler shows a tight grouping at 250μs with minor spread of outliers; real time scheduling shows grouping at a better latency of 200μs but higher spread of outliers.

Linux latency in microseconds at 30,000 packets-per-second one-way.
Normal scheduler shows grouping at 250μs with spread from 200μs-1ms; real time scheduling shows spread of grouping 200-250μs with outliers to 1ms.

Linux latency in microseconds at 40,000 packets-per-second one-way.

Normal scheduler shows grouping 200-600μs with outliers spread to 2ms; real time scheduling shows grouping 250-300μs with outliers spread to 1.5ms with a strange gap from 1.0-1.3ms.

Wherefore art thou IP packet? Make haste Windows 2008 R2

How camest thou hither, tell me, and wherefore?
Lest the bard fray more, the topic is of PGM haste in the homogeneous environment, and the unfortunate absence of said haste. We take performance readings of PGM across multiple hosts and present a visual heat map of latency to provide insight to the actual performance. Testing entails transmission of a message onto a single LAN segment, the message is received by a listening application which immediately re-broadcasts the message, when the message is received back at the source the round-trip-time is calculated using a single high precision clock source.
Performance testing configuration with a sender maintaining a reference clock to calculate message round-trip-time (RTT).
The baseline reading is taken from Linux to Linux, the reference hardware is an IBM BladeCentre HS20 with Broadcom BCM5704S gigabit Ethernet adapters and the networking infrastructure is provided by a BNT fibre gigabit Ethernet switch.
Latency in microseconds from Linux to Linux at 10,000 packets-per-second one-way.
The numbers themselves are of minor consequence, for explanation at 10,000 packets-per-second (pps) there is a marked grouping at 200μs round-trip-time (RTT) latency.  The marketing version would be 20,000pps, as we consider 10,000pps being transmitted, and 10,000pps being received simultaneously, with a one-way latency of 100μs.  Also note that the packet reflection is implemented at the application layer much like any end-developer written software using the OpenPGM BSD socket API, compare this with alternative testing configurations that may operate at the network layer and bypass the effective full latency of the networking stack and yield to disingenuous figures.

20,000 feet (6,096 meters) Sir Hillary
Onward and upward we must go, with an IFG of 96ns the line capacity of a gigabit network is 81,274pps leading to the test potential limit of 40,000pps one-way with a little safety room above.

Latency in microseconds from Linux to Linux at 20,000 packets-per-second one-way.
At 20,000pps we start to see a spread of outliers but notice the grouping remains at 200μs.

Latency in microseconds from Linux to Linux at 30,000 packets-per-second one-way.
At 30,000pps outlier latency jumps to 1ms.

Latency in microseconds from Linux to Linux at 40,000 packets-per-second on-way.
At 40,000pps you are starting to see everything start to break down with the majority of packets from 200-600μs.  Above 40,000pps the network is saturated and packet loss starts to occur, packet loss initiating PGM reliability and consuming more bandwidth than is available for full speed operation.

Windows 2008 R2 performance
Here comes Windows.  Windows!
Latency in microseconds from Windows to Linux at 10,000 packets-per-second one-way.
Non-blocking sockets at 10,000pps show a grouping just as Linux at 200μs but the spread of outliers reaches as far as 2ms. This is highlights the artifacts of a low scheduling granularity and an inefficient IP stack.

Spot the difference
A common call to arms on Windows IP networking hoists the flag of Input/Output Completion Ports or IOCP as a more efficient design to event handling as it depends upon blocking sockets, reducing the number of system calls to send and receive packets, and permits zero-copy memory handling.

Latency in microseconds from Windows to Linux at 10,000 packets-per-second one-way using IOCP.
The only difference is a slightly lighter line at ~220μs, the spread of latencies is still rather broad.
Increasing the socket buffer sizes permits the test to run at 20,000pps but with heavy packet loss requiring the PGM reliability engine yielding 1-2 seconds average latency.

High performance Windows datagrams
All the popular test utilities use blocking sockets to yield remotely reasonable figures. These include netperf, iperf, ttcp, and ntttcp - Microsoft's port of ttcp for Windows sockets. These sometimes yield higher raw numbers than iperf on Linux which considering the previous results is unexpected.
  • iperf on Linux yields ~70,000pps.
  • PCATTCP on Windows yields ~90,000pps.
  • ntttcp on Windows yields ~190,000pps.
  • iperf on Windows yields ~20,000pps.
There appears to be either a significant driver flaw or severe Windows limitation in transmit interrupt coalescing as the resultant bandwidths from testing yield ~800mbs to a Windows host, but only ~400mbs from a Windows host. Drivers for the Broadcom Ethernet adapter have undergone many revisions from 2001 through to present, all consistently show weak transmit performance even with TCP transports.

Windows registry settings
To achieve these high performance Windows results the following changes were applied.
  • Disable the multimedia network throttling scheduler. By default Windows limits streams to 10,000pps when a multimedia application is running.
Under HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\WindowsNT\CurrentVersion\Multimedia\SystemProfile set NetworkThrottlingIndex, type REG_DWORD32, value set to 0xffffffff.
  • Two settings define an IP stack path for datagrams, by default datagrams above 1024 go through a slow locked double buffer, increase this to the network MTU size, i.e. 1500 bytes.
Under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\AFD\Parameters:
  • FastSendDatagramThreshold, type REG_DWORD32, value set to 1500. 
  • FastCopyReceiveThreshold, type REG_DWORD32, value set to 1500.

  • Without hardware acceleration for high resolution time stamping incoming packets will be tagged with the receipt time at expense of processing time.  Disable the time stamps on both sender and receiver. This is performed by means of the following command:

netsh int tcp set global timestamps=disabled
  • A firewall will intercept all packets causing increased latency and processing time, disable the filtering and firewall services to ensure direct network access.  Disable the following services:
  • Base Filtering Engine (BFE) 
  • Windows Firewall (MpsSvc)

Additional notes
The test hardware nodes are single core Xeons and so Receive Side Scaling (RSS) does not assist performance.  Also, the Ethernet adapters do not support Direct Cache Access (DCA) also known as NetDMA 2.0 which should improve performance by reducing system time required to send or receive packets.

To significantly increase the default socket buffer size you can set a multiplier number under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\AFD\Parameters:
, type REG_DWORD32, value set to 0x400.