みるブログ

PGM_IO_STATUS_TIMER_PENDING

2011-10-08T02:49:00.000+08:00

Under ideal conditions there is a constant stream of data on the network and every call to pgm_recv returns data and there is no data loss or dropped packets. Ideal conditions are rare though, we might see bursty data from senders, senders may close or crash, packets may be lost.
At the most basic level we need to be maintained that the senders exist, senders notify their presence by repeated broadcast of SPM packets. Packet loss, closed or crashed applications would cause an absence of SPM broadcasts and this situation can be caught by a timer. If no packets are seen within say 30 seconds we consider the sender to no longer to be operational.
The receive window extends beyond that to monitor every incoming packet, NAK elimination to prevent transmission of duplicate NAKs from the same or different receivers, and retransmission of NAKs for when the retransmit request itself was lost in the network. Each state is driven by configurable timers or timeouts.

This means that as soon as a single packet from a sender is received the common return values expected are PGM_IO_STATUS_NORMAL for data and PGM_IO_STATUS_TIMER_PENDING for no-data or receive-state transitions.

PGM_IO_STATUS_WOULD_BLOCK

2011-10-08T02:36:00.000+08:00

Assuming non-blocking sockets, the return value PGM_IO_STATUS_WOULD_BLOCK appears when the call to pgm_recv cannot immediately return due to no available contiguous data.

The important note about this return value is that it indicates that there are no known senders and the receive state engine is waiting for starting data packets to begin processing.

PGM_IO_STATUS_NORMAL

2011-10-08T02:30:00.000+08:00

When calling pgm_recv the return value for indicating data was successfully read is PGM_IO_STATUS_NORMAL, this is obviously the ideal case and assuming a constant stream of data to read from the network.

OpenPGM follows a reactor model, calling pgm_recv will then read a packet from the underlying UDP or RAW socket. If the packet is an original data packet, called ODATA, it will be inserted into the receive window and if-and-only-if the sequence is contiguous to the current lead the payload can be returned to the application.

4. Non-operational IPv4 adapters

2011-07-08T03:56:00.000+08:00

So the question arises, if we detect an adapter that is not "operationally up" and hence with no prefix, if we can assume that IPv4 link-local addresses always have a 16-bit prefix what are the others?

First obvious candidate would be a static configured host IP address with the "media disconnected", i.e. no network cable.

Let's see how much information Windows grants the typical CJ.

Ethernet adapter Local Area Connection:

  Media State . . . . . . . . . . . : Media disconnected
  Connection-specific DNS Suffix  . : hk.miru.hk
  Description . . . . . . . . . . . : Broadcom NetXtreme Gigabit Ethernet
  Physical Address. . . . . . . . . : C4-2C-03-21-78-AB
  DHCP Enabled. . . . . . . . . . . : No
  Autoconfiguration Enabled . . . . : Yes

We see that the adapter exists and absolutely no indication of an address other than DHCP is disabled. Let's look at the results of IPv4 adapter enumeration using the GetAdaptersInfo API.

Info: #13 name {12D5DC53-E214- IPv4 0.0.0.0
   scope 0 status UP   loop NO  b/c YES m/c YES
Info: #11 name {61F5BC1C-1D95- IPv4 0.0.0.0
   scope 0 status UP   loop NO  b/c YES m/c YES
Info: #10 name {FFF6B15A-5B5C- IPv4 10.208.0.104
   scope 0 status UP   loop NO  b/c YES m/c YES
Info: #19 name {D8ED3DA1-9FAC- IPv4 192.168.56.1
   scope 0 status UP   loop NO  b/c YES m/c YES

The Ethernet adapter is index #11 and the Windows 2000 API returns the host IP address as 0.0.0.0. Let's look at the Windows XP API, GetAdaptersAddresses, excluding IPv6 addressing.

Info: #13 name {12D5DC53-E214- IPv4 169.254.140.145
   scope 0 status DOWN loop NO  b/c NO  m/c YES
Info: #11 name {61F5BC1C-1D95- IPv4 169.254.228.116
   scope 0 status DOWN loop NO  b/c NO  m/c YES
Info: #11 name {61F5BC1C-1D95- IPv4 172.16.0.1
   scope 0 status DOWN loop NO  b/c NO  m/c YES
Info: #10 name {FFF6B15A-5B5C- IPv4 10.208.0.104
   scope 0 status UP   loop NO  b/c NO  m/c YES
Info: #19 name {D8ED3DA1-9FAC- IPv4 192.168.56.1
   scope 0 status UP   loop NO  b/c NO  m/c YES
Info: #1 name {846EE342-7039- IPv4 127.0.0.1
  scope 0 status UP   loop YES b/c NO  m/c YES

Windows is returning two different interfaces for the adapter, one is a IPv4 link-local prefixed address, 169.254.228.116 and the other is the configured static host IP address 172.16.0.1.

In conclusion we find that Windows cannot provide the netmask or network prefix for any adapter that is not marked "operationally up". The older Windows 2000 API cannot even report the IP address of such adapters, the newer Windows XP API fairs a little better but we can only determine the prefix of IPv4 link-local addresses without additional information.

A Cupcake and a Teredo

2011-07-07T02:34:00.004+08:00

IPv6 is starting to garner more interest around the world with a multitude of options being presented for co-operation with existing IPv4 hosts. Several schemes already deployed are targeting how to ensure the IPv6 Internet is accessible to IPv4-only users. When you try to access an IPv6 website such as ipv6.google.com and an IPv6 address is returned the scheme will tunnel the request over the IPv4 Internet to a host that can speak both IPv4 and IPv6 that forwards the request onto the IPv6 target.

IPv6 similar to IPv4 was designed with a specific spit between which part of an address refers to the network and that to the host.

"Unicast and anycast addresses are typically composed of two logical parts: a 64-bit network prefix used for routing, and a 64-bit interface identifier used to identify a host's network interface."

http://en.wikipedia.org/wiki/IPv6_address

However there is this is purely a recommendation not an in concrete part of the protocol addressing. As implied non-unicast addresses may have a different network prefix size, for example multicast addresses have a 8-bit prefix.

As it turns out different adapter types also can have different network prefixes. A common example is a tunnel such as a VPN, typically a point-to-point communication between your computer and a VPN concentrator. A point-to-point link has no multicast, broadcast, or sub-networking and hence will have a full 128-bit prefix. Then, we have Teredo, and a cupcake.

A Cupcake

A raspberry tiramisu cupcake from "Cupcake Wars".

When configuring computer networking most CJ's will be aware of the basic required numbers, a host IP address, a subnet mask, and a default gateway. Here is a typical view on a Windows host with the ipconfig command.

Ethernet adapter Local Area Connection:

Connection-specific DNS Suffix  . :
IP Address. . . . . . . . . . . . : 192.168.131.65
Subnet Mask . . . . . . . . . . . : 255.255.255.0
Default Gateway . . . . . . . . . : 192.168.131.254

On OSX the ifconfig command might generate something similar.

en0: flags=8863<up,broadcast,smart,running,simplex,multicast> mtu 1500</up,broadcast,smart,running,simplex,multicast>
inet 10.6.27.34 netmask 0xffffff00 broadcast 10.6.27.255

As you can see the tuple of address and netmask is quite visible for both. A trip over to MSDN with the GetAdaptersInfo function presents an API to enumerate adapters The API returns a linked list of IP_ADAPTER_INFO objects for each adapter with an IP_ADDR_STRING for each interface on the adapter containing the host IP address and subnet IP address. As you might note the MSDN articles indicate that Windows XP developers should use the GetAdaptersAddresses function as this also presents IPv6 addresses and makes it easier to determine unidirectional adapters such as satellite Internet connections.

The Windows XP API generates a list of IP_ADAPTER_ADDRESSES objects which vary in size depending on which Windows version is operating. The list of interfaces per adapter is found in the IP_ADAPTER_UNICAST_ADDRESS list but includes no netmask. On Windows Vista and later a field called OnLinkPrefixLength is provided, the prefix indicating the length of the network IP address, presumably to reduce user errors in specifying illegal masks such as 255.0.0.255. So what about Windows XP developers?

A list of IP_ADAPTER_PREFIX objects is provided for Windows XP SP1 users, indicating that nothing is available pre-service pack. The list is one prefix per interface and with Windows XP has been a one-to-one mapping with the unicast interface list. With Windows Vista this list has been expanded, breaking compatibility with early assumptions.

"In addition, the linked IP_ADAPTER_UNICAST_ADDRESS structures pointed to by the FirstUnicastAddress member and the linked IP_ADAPTER_PREFIX structures pointed to by the FirstPrefix member are maintained as separate internal linked lists by the operating system. As a result, the order of linked IP_ADAPTER_UNICAST_ADDRESS structures pointed to by the FirstUnicastAddress member does not have any relationship with the order of linked IP_ADAPTER_PREFIX structures pointed to by the FirstPrefix member.
On Windows Vista and later, the linked IP_ADAPTER_PREFIX structures pointed to by the FirstPrefix member include three IP adapter prefixes for each IP address assigned to the adapter. These include the host IP address prefix, the subnet IP address prefix, and the subnet broadcast IP address prefix. In addition, for each adapter there is a multicast address prefix and a broadcast address prefix.
On Windows XP with SP1 and later prior to Windows Vista, the linked IP_ADAPTER_PREFIX structures pointed to by the FirstPrefix member include only a single IP adapter prefix for each IP address assigned to the adapter."

Notice the wording, the list "includes" prefixes but no order or existence is guaranteed. So lets investigate some adapters and see what results are provided by this API.

1. The Loopback Adapter

Windows prefers to hide the loopback adapter and interfaces from ipconfig but other platforms aren't so shy.

lo0: flags=8049<up,loopback,running,multicast> mtu 16384
        inet6 fe80::1%lo0 prefixlen 64 scopeid 0x1
        inet 127.0.0.1 netmask 0xff000000
        inet6 ::1 prefixlen 128</up,loopback,running,multicast>

Notice OSX adds a link-local interface to the loopback adapter contrasting with Windows and Linux.

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1

We have the standard IPv4 address 127.0.0.1 with a network prefix of 8-bits, IPv6 comes with address ::1 and full prefix of 128-bits. Now back to the Windows XP API.

::1/128
ff00::%1/8
127.0.0.0/8
127.0.0.1/32
127.255.255.255/32
224.0.0.0/4
255.255.255.255/32

There is no IPv6 subnet IP address as presumably an optimisation to not having a subnet, the host IP address, and multicast address; IPv6 does not not include broadcast. On IPv4 we have the subnet IP address, host IP address, subnet broadcast IP address, all-host multicast and broadcast IP addresses.

2. Active IPv6 adapter

In IPv6 land there are usually two interfaces per adapter, one link-local, and one global-scope.

::/0
2001::/32
2001:0:4137:9e76:2443:d6:ba87:1a2a/128
fe80::/64
fe80::2443:d6:ba87:1a2a/128
ff00::/8

A new entry appears not documented on MSDN, the IPv6 unspecified address ::/0 which a zero length prefix indicating that this adapter hosts the default route for IPv6. There is no IPv4 equivalent ever listed (0.0.0.0), and the prefix only appears when global-scope addresses are enabled.

Surprisingly the Windows Vista and later API returns a prefix length of 64-bits for a Teredo sourced address despite the routing table showing 32-bits and all documentation referring to standard prefix of 2001:0::/32. This is assumed to be a defect in Windows 7 SP1.

3. Point-to-point tunnels

Continuing the optimisation seen hiding the subnet IP address for when there is no subnet with PTP interfaces with no support for broadcast or multicast traffic then the only prefix listed is for the host IP address.

fe80::5efe:10.203.9.30/128

This address is actually a IPv4 mapped link-local IPv6 address for a IPv4 PTP VPN tunnel.

4. Non-operational IPv4 adapters

It is common to find that many hosts these days contain network adapters that are not in use, for example additional Ethernet ports, Bluetooth, and even virtual WiFi or virtual machine host-only adapters. Surprisingly Windows has a special treatment for these adapters.

fe80::%13/64
fe80::d530:946d:e8df:8c91%13/128
ff00::%13/8
224.0.0.0/4
255.255.255.255/32

As the adapter is non-operational Windows is hiding the IPv4 address from the prefix list hence making it impossible to determine the prefix length and netmask. Bizarrely though IPv4 all-host multicast and broadcast addresses are still returned. This may be an artifact of IPv4 link-local addressing on Windows that permits the use of IPv4 broadcast and multicast over a non-configured adapter for device discovery. Therefore you can pull the unicast address and detect for a link-local address and assume the standard 16-bit prefix.

With a static address interface that is non-operational it appears the only method to determine the network prefix length, and hence the netmask is to use the older Windows 2000 GetAdaptersInfo function.

Real-Time Linux

2011-04-26T14:58:00.000+08:00

Real-Time Linux, a set of patches to improve scheduling consistency and available in SUSE Linux Enterprise Real Time and RedHat Enterprise MRG.

Linux latency in microseconds at 10,000 packets-per-second one-way.

Y-axis is latency in microseconds, X-axis is time in seconds. Chart records test of 10,000 packets-per-second send out and received, a total of 20,000 datagrams-per-second. The left hand side shows Linux 2.6.36 with normal scheduling (SCHED_OTHER) with a tight grouping around 200μs; the right hand side shows Linux 2.6.26 with real-time scheduling (SCHED_FIFO) with tighter grouping at 200μs but larger spread of outliers.

Linux latency in microseconds at 20,000 packets-per-second one-way.

Normal scheduler shows a tight grouping at 250μs with minor spread of outliers; real time scheduling shows grouping at a better latency of 200μs but higher spread of outliers.

Linux latency in microseconds at 30,000 packets-per-second one-way.

Normal scheduler shows grouping at 250μs with spread from 200μs-1ms; real time scheduling shows spread of grouping 200-250μs with outliers to 1ms.

Linux latency in microseconds at 40,000 packets-per-second one-way.

Normal scheduler shows grouping 200-600μs with outliers spread to 2ms; real time scheduling shows grouping 250-300μs with outliers spread to 1.5ms with a strange gap from 1.0-1.3ms.

Wherefore art thou IP packet? Make haste Windows 2008 R2

2011-04-26T14:45:00.000+08:00

How camest thou hither, tell me, and wherefore?
Lest the bard fray more, the topic is of PGM haste in the homogeneous environment, and the unfortunate absence of said haste. We take performance readings of PGM across multiple hosts and present a visual heat map of latency to provide insight to the actual performance. Testing entails transmission of a message onto a single LAN segment, the message is received by a listening application which immediately re-broadcasts the message, when the message is received back at the source the round-trip-time is calculated using a single high precision clock source.

Performance testing configuration with a sender maintaining a reference clock to calculate message round-trip-time (RTT).

The baseline reading is taken from Linux to Linux, the reference hardware is an IBM BladeCentre HS20 with Broadcom BCM5704S gigabit Ethernet adapters and the networking infrastructure is provided by a BNT fibre gigabit Ethernet switch.

Latency in microseconds from Linux to Linux at 10,000 packets-per-second one-way.

The numbers themselves are of minor consequence, for explanation at 10,000 packets-per-second (pps) there is a marked grouping at 200μs round-trip-time (RTT) latency. The marketing version would be 20,000pps, as we consider 10,000pps being transmitted, and 10,000pps being received simultaneously, with a one-way latency of 100μs. Also note that the packet reflection is implemented at the application layer much like any end-developer written software using the OpenPGM BSD socket API, compare this with alternative testing configurations that may operate at the network layer and bypass the effective full latency of the networking stack and yield to disingenuous figures.

20,000 feet (6,096 meters) Sir Hillary
Onward and upward we must go, with an IFG of 96ns the line capacity of a gigabit network is 81,274pps leading to the test potential limit of 40,000pps one-way with a little safety room above.

Latency in microseconds from Linux to Linux at 20,000 packets-per-second one-way.

At 20,000pps we start to see a spread of outliers but notice the grouping remains at 200μs.

Latency in microseconds from Linux to Linux at 30,000 packets-per-second one-way.

At 30,000pps outlier latency jumps to 1ms.

Latency in microseconds from Linux to Linux at 40,000 packets-per-second on-way.

At 40,000pps you are starting to see everything start to break down with the majority of packets from 200-600μs. Above 40,000pps the network is saturated and packet loss starts to occur, packet loss initiating PGM reliability and consuming more bandwidth than is available for full speed operation.

Windows 2008 R2 performance

Here comes Windows. Windows!

Latency in microseconds from Windows to Linux at 10,000 packets-per-second one-way.

Non-blocking sockets at 10,000pps show a grouping just as Linux at 200μs but the spread of outliers reaches as far as 2ms. This is highlights the artifacts of a low scheduling granularity and an inefficient IP stack.

Spot the difference

A common call to arms on Windows IP networking hoists the flag of Input/Output Completion Ports or IOCP as a more efficient design to event handling as it depends upon blocking sockets, reducing the number of system calls to send and receive packets, and permits zero-copy memory handling.

Latency in microseconds from Windows to Linux at 10,000 packets-per-second one-way using IOCP.

The only difference is a slightly lighter line at ~220μs, the spread of latencies is still rather broad.

Increasing the socket buffer sizes permits the test to run at 20,000pps but with heavy packet loss requiring the PGM reliability engine yielding 1-2 seconds average latency.

High performance Windows datagrams

All the popular test utilities use blocking sockets to yield remotely reasonable figures. These include netperf, iperf, ttcp, and ntttcp - Microsoft's port of ttcp for Windows sockets. These sometimes yield higher raw numbers than iperf on Linux which considering the previous results is unexpected.

iperf on Linux yields ~70,000pps.
PCATTCP on Windows yields ~90,000pps.
ntttcp on Windows yields ~190,000pps.
iperf on Windows yields ~20,000pps.

There appears to be either a significant driver flaw or severe Windows limitation in transmit interrupt coalescing as the resultant bandwidths from testing yield ~800mbs to a Windows host, but only ~400mbs from a Windows host. Drivers for the Broadcom Ethernet adapter have undergone many revisions from 2001 through to present, all consistently show weak transmit performance even with TCP transports.

Windows registry settings

To achieve these high performance Windows results the following changes were applied.

Disable the multimedia network throttling scheduler. By default Windows limits streams to 10,000pps when a multimedia application is running.

Under HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\WindowsNT\CurrentVersion\Multimedia\SystemProfile set NetworkThrottlingIndex, type REG_DWORD32, value set to 0xffffffff.

Two settings define an IP stack path for datagrams, by default datagrams above 1024 go through a slow locked double buffer, increase this to the network MTU size, i.e. 1500 bytes.

Under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\AFD\Parameters:

FastSendDatagramThreshold, type REG_DWORD32, value set to 1500.

FastCopyReceiveThreshold, type REG_DWORD32, value set to 1500.

Without hardware acceleration for high resolution time stamping incoming packets will be tagged with the receipt time at expense of processing time. Disable the time stamps on both sender and receiver. This is performed by means of the following command:

netsh int tcp set global timestamps=disabled

A firewall will intercept all packets causing increased latency and processing time, disable the filtering and firewall services to ensure direct network access. Disable the following services:

Base Filtering Engine (BFE)

Windows Firewall (MpsSvc)

Additional notes
The test hardware nodes are single core Xeons and so Receive Side Scaling (RSS) does not assist performance. Also, the Ethernet adapters do not support Direct Cache Access (DCA) also known as NetDMA 2.0 which should improve performance by reducing system time required to send or receive packets.

To significantly increase the default socket buffer size you can set a multiplier number under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\AFD\Parameters:
BufferMultiplier, type REG_DWORD32, value set to 0x400.

Strawberries and a Python through the Looking Glass

2011-01-29T05:32:00.000+08:00

How do we get OpenPGM on Windows? This is an obvious question popular with new developers. The stumbling blocks tend to be that OpenPGM is written to the ANSI C99 specification and Microsoft's Visual C++ 2010 compiler only supports ANSI C89 or C++ 2003.

Convexing Cross-Compiling of C

Microsoft hasn't the only C compiler vendor targeting Windows, and it also isn't necessary that the compiler actually runs on Windows, by a process called cross-compiling the Windows platform can be targeted from another. This also tends to be scary for many Windows developers as they need a second operating system, learn how to use a new compiler, and finally you have additional compiler dependencies when bringing back the library on Windows.

Priceless Princely Patch for Progress

As it so happens cross-compilation has its pros and cons, on the plus side it means one code base and one build system; on the negative side it means the resultant library brings in additional libraries for the cross-compiler and its compatibility layers, and debugging is made rather inconvenient.

So the benefits of a native build are strong but how much effort is required to modify C99 code to get it working in a C89 compiler and how difficult is a native build system? Well the differences from one compiler to the next can be managed with a patch cluster. The build system is similar to Automake and SConscript build components as the Autoconf comparables are constant to Windows XP SP3. We choose a build system that can target agnostic to Visual Studio compiler version for greatest compatibility to minimize maintenance overhead.

CMake Make Make

With Microsoft Visual C++ 2010 installed we need to install some extra software to start, first the build system called CMake which has a simple Win32 installer with minimal options, it is recommended to allow it to update your system PATH so you can use immediately. Next are two critical build dependencies for OpenPGM, we need both the Perl and Python 2 scripting languages both of which offer x86 and x64 builds depending on your platform architecture.

Extract the source for the latest OpenPGM archive into somewhere convenient, for example C:\>

CMake generating Makefile.

By default CMake is creating a Makefile with debugging enabled, you can use cmake-gui to update the parameter for a release build.

CMake-GUI updating configuration parameters and regenerating Makefile.

With a successfully generated Makefile we can start the build process using Microsoft's standard command line tools.

Generating C89-compatible source files.

The build process continues automatically by first patching all the source files for C89 compatibility, the C89-compatible source files are fed into the compiler, finally the object files are linked together into the native library.

Generating object files and final libpgm library.

Now complete we can start to use the library to build the example applications and start development. An additional feature of the CMake system is called CPack which allows us to create a Windows installer we can use to install OpenPGM binaries on other systems.

First download the latest NSIS package and install with full options, then simply run cmake package to produce the Windows installer.

Building an OpenPGM Windows installer with CPack and NSIS.

What is PGM? What is OpenPGM?

2011-01-29T02:46:00.000+08:00

Well lets start by saying PGM is the name of a internet, lower-case "i", communications protocol, and OpenPGM is an implementation of that protocol. A communications protocol being the definition of how two or more computers or electronic devices communicate with each other. For example you just enjoyed a good slice of cherry pie and damn fine cup of coffee and you wanted to tweet it to your friends, you tap the message into your phone and hit a button, then by magic the message appears on your microblog. Between your phone's Twitter app' and the Twitter website your tweet message has been sent following a predefined communications protocol.

There is a subtle difference between the Internet, upper-case "I", and an internet, lower-case "i", the former is the name for the public network where we can access Facebook and Twitter, the latter is the generic term for a group of interconnected networks. This means we could find PGM on the Internet or we could find it on other non-public networks such as within a stock exchange's trading system or communicating physics of the Higgs Boson at the Large Hadron Collider.

So what is special about PGM?

The erhu is a Chinese two-stringed bowed musical instrument and not commonly found in Western music. If you wanted to listen to a recital of Sanmen Capriccio a convenient choice might be to watch a performance by Yang Ying on YouTube. For a live performance you might tune in the radio and listen alongside other fans of huqin. On the Internet you may find an Internet radio station to listen on but there is fundamental limitation on how many fans can be plugging in. Say Girls Generation announced a online concert tonight the stampede from eager fans may overload the site and crash the network as the concert site has to stream a duplicate copy of the event to each and every listener. The PGM protocol allows one sent stream of data to be received by multiple recipients, the fan-out of the stream to each recipient being managed by the network instead of say the host concert site.

Why should I choose OpenPGM?

So PGM is the protocol on paper, digitally at least, you need an actual implementation for something to start looking like it will work. There are a selection of vendors providing commercial solutions, including Microsoft, IBM, and TIBCO, and a variety of open source projects as long as your preferred platform is FreeBSD. So an open source implementation that works on multiple platforms, is compatible with existing vendor solutions, and is even faster and more flexible sounds like a good choice, then have a look at OpenPGM.

Miru Announces OpenPGM 5

2010-09-07T06:56:00.001+08:00

New York - September 6, 2010 - Miru, Limited, a small development studio of enterprise middleware, announces a new BSD socket interface for its OpenPGM messaging software, an open source low latency reliable multicast solution based on the standard for broadcasting information over an internet. The Berkeley sockets API is the defacto standard for abstraction of network sockets and so reduces the learning curve necessary to implement new solutions using OpenPGM.

Performance for one-way messaging of roughly 75 to 80 microseconds with throughput of approximately 540 megabits per second to applications running on a single core commodity system.

The transport technology standard, known as Pragmatic General Multicast, enables private networks and the Internet to handle more traffic by sending critical business information in a more reliable, cost-effective and bandwidth-friendly manner.

The PGM reliable transport protocol communications technology, which was designed by Cisco Systems and TIBCO Software, is registered with the Internet Engineering Task Force (IETF), the Internet standards body.

PGM enabled network devices, such as Cisco, Juniper, or Nortel routers, enhance the scalability and reliability of the technology by eliminating redundant traffic when recovering lost messages.

The updated transport is supported on Linux and Solaris platforms on IA32, x86-64, SPARCv9 architectures, with other platforms and architectures such as Microsoft Windows and Mac OS X with functional builds but support added as customer needs dictate. OpenPGM is Wire compatible with Microsoft’s PGM implementation as available in Microsoft Windows Server 2003 and Microsoft Windows XP with Microsoft Message Queuing.

About IP Multicast

In computer networking, broadcast refers to transmitting a message to every device on the network, a one-to-many paradigm similar to television or radio. Multicast is a technique to only deliver to those recipients expressing an interest in the content. A multicast source is only required to send a message once, the network infrastructure takes care of replicating to each receiver as necessary. Conventional unicast applications require the server to send copies of the same message to each recipient.

Multicast does not guarantee reliability or ordering of messages. A recipient may receive messages out of order, duplicated, or missing with no notice.

About Pragmatic General Multicast (PGM)

PGM is a reliable multicast transport protocol developed by a range of vendors including Cisco and TIBCO and described in RFC 3208.

About Miru, Limited.

Miru is development studio specialising in building high-quality, open source multicast message orientated middleware systems. Miru also offers support, training and consulting services to its customers worldwide.

Learn more: http://miru.hk .

LINUX is a trademark of Linus Torvalds. MIRU is a trademark of Miru, Limited. All other product and company names and marks mentioned in this document are property of their respective owners and are mentioned for identification purposes only.

Miru Announces OpenPGM 3

2010-04-23T12:03:00.000+08:00

Hong Kong - April 23, 2010 - Miru, Limited, a small development studio of enterprise middleware, announces support for Solaris 10 on SPARCv9 platform for its OpenPGM messaging software, an open source low latency reliable multicast solution based on the standard for broadcasting information over an internet. Performance for one-way messaging of roughly 75 to 99 microseconds with throughput of approximately 540 megabits per second to applications running on a single core commodity system.

The transport technology standard, known as Pragmatic General Multicast, enables private networks and the Internet to handle more traffic by sending critical business information in a more reliable, cost-effective and bandwidth-friendly manner.

The PGM reliable transport protocol communications technology, which was designed by Cisco Systems and TIBCO Software, is registered with the Internet Engineering Task Force (IETF), the Internet standards body.

PGM enabled network devices, such as Cisco, Juniper, or Nortel routers, enhance the scalability and reliability of the technology by eliminating redundant traffic when recovering lost messages.

The updated transport is supported on Linux and Solaris platforms on IA32, x86-64, SPARCv9 architectures, with other platforms and architectures added as customer needs dictate. OpenPGM is Wire compatible with Microsoft’s PGM implementation as available in Microsoft Windows Server 2003 and Microsoft Windows XP with Microsoft Message Queuing.

About IP Multicast

Multicast does not guarantee reliability or ordering of messages. A recipient may receive messages out of order, duplicated, or missing with no notice.

About Pragmatic General Multicast (PGM)

PGM is a reliable multicast transport protocol developed by a range of vendors including Cisco and TIBCO and described in RFC 3208.

About Miru, Limited.

Learn more: http://miru.hk .

Miru Announces Full IPv6 and Windows Support with Sub 100-Microsecond Latency

2010-02-20T00:46:00.001+08:00

Hong Kong - February 20, 2010 - Miru, Limited, a small development studio of enterprise middleware, announces support for Internet Protocol version 6 (IPv6) and Microsoft Windows XP through Windows 7 platforms for its OpenPGM messaging software, an open source low latency reliable multicast solution based on the standard for broadcasting information over an internet. Performance for one-way messaging of roughly 82 to 107 microseconds with throughput of approximately 270 megabits per second to applications running on a single core commodity system.

Miru has developed Windows platform support to remove the deficiencies in the native support for the PGM protocol, including IPv6 support and UDP encapsulation, and to provide a consistent cross platform interface for application development.

The transport technology standard, known as Pragmatic General Multicast, enables private networks and the Internet to handle more traffic by sending critical business information in a more reliable, cost-effective and bandwidth-friendly manner.

The PGM reliable transport protocol communications technology, which was designed by Cisco Systems and TIBCO Software, is registered with the Internet Engineering Task Force (IETF), the Internet standards body.

PGM enabled network devices, such as Cisco, Juniper, or Nortel routers, enhance the scalability and reliability of the technology by eliminating redundant traffic when recovering lost messages.

The updated transport is supported on Windows, Linux and Solaris platforms on IA32 and x86-64 architectures, with other platforms and architectures added as customer needs dictate. OpenPGM is Wire compatible with Microsoft’s PGM implementation as available in Microsoft Windows Server 2003 and Microsoft Windows XP with Microsoft Message Queuing.

About IP Multicast

Multicast does not guarantee reliability or ordering of messages. A recipient may receive messages out of order, duplicated, or missing with no notice.

About Pragmatic General Multicast (PGM)

PGM is a reliable multicast transport protocol developed by a range of vendors including Cisco and TIBCO and described in RFC 3208.

About Miru, Limited.

Learn more: http://miru.hk .

Miru Ships Standards Based Low Latency Open Source Messaging Software

2009-05-15T12:43:00.004+08:00

Hong Kong - May 15, 2009 - Miru, Limited, a development studio of enterprise middleware and applications integration, today announced the immediate availability of OpenPGM, an open source low latency reliable multicast messaging software based on the standard for broadcasting information over an internet.

The transport technology standard, known as Pragmatic General Multicast, enables private networks and the Internet to handle more traffic by sending critical business information in a more reliable, cost-effective and bandwidth-friendly manner.

The PGM reliable transport protocol communications technology, which was designed by Cisco Systems and TIBCO Software, is registered with the Internet Engineering Task Force (IETF), the Internet standards body.

PGM enabled network devices, such as Cisco, Juniper, or Nortel routers, enhance the scalability and reliability of the technology by eliminating redundant traffic when recovering lost messages.

The initial general release is available for Linux and Solaris platforms on IA32 and x86-64 architectures and is wire compatible with Microsoft’s PGM implementation as available in Microsoft Windows Server 2003 and Microsoft Windows XP with Microsoft Message Queuing.

About IP Multicast

In computer networking, broadcast refers to transmitting a message to every device on the network, a one-to-many paradigm similar to television or radio. Multicast is a technique to only deliver to those recipients expressing an interest in the content. A multicast source is only required to send a message once, the network infrastructure takes care of replicating to each receiver as necessary. Conventional unicast applications require the server to send copies of the same message to each recipient.

Multicast does not guarantee reliability or ordering of messages. A recipient may receive messages out of order, duplicated, or missing with no notice.

About Pragmatic General Multicast (PGM)

PGM is a reliable multicast transport protocol developed by a range of vendors including Cisco and TIBCO and described in RFC 3208.

About Miru, Limited.

Miru is development studio specialising in building high-quality, open source multicast message orientated middleware systems. Miru also offers support, training and consulting services to its customers worldwide.
Learn more: http://miru.hk .

LINUX is a trademark of Linus Torvalds. MIRU is a trademark of Miru, Limited. All other product and company names and marks mentioned in this document are property of their respective owners and are mentioned for identification purposes only.

Flavours of Multicast

2009-02-24T16:21:00.008+08:00

The reason why multicast is not prevalent on the Internet today is due to two main reasons, first is lack of support in network infrastructure. Multicast is an optional part of the IPv4 protocol and so not every vendor has implemented support. Second is filtering, as in what control is there over different parties sending data to any multicast group. This is an issue as multicast uses a separate range of IP addresses for its communication of which is a limited number.

As an example, imagine the US President Obama’s inauguration is being being multicast live on the Internet, what happens if at the same time a radio station in New Zealand is broadcasting live news, the Hong Kong Stock Exchange is publishing stock prices, and Wembley stadium sending live match details from London? The answer is a mess, wasted routing and link resources forwarding packets from all around to world to parties simply not interested.

This method of multicast, the default operation, is called any-source multicast (ASM), and is more suited to controlled environments such as private networks in which applications and network topology can be arranged to suit the expected usage and conflicts from other applications is not going to occur.

Source-specific multicast (SSM) was then created to limit the source of packets to a selected range of addresses. This requires end-point router support, IGMPv3 for IPv4, and MLDv2 for IPv6, together with operating system support for the matching API and filtering without router support.

Send send send

2008-04-02T10:50:00.007+08:00

When sending messages that a larger than one TSDU in size multiple options start to appear, some options are tied to the network layer properties for optimum transmission efficiency, some more generic to simplify application development. The following chart lists the options:

The first and second option covers the basic simplified application layer high level function, pass one or a vector of application defined message buffers. OpenPGM will then segment those buffers to the TSDU size determined from the maximum TPDU and PGM header requirements.

A traditional scatter/gather IO vector can be used with the last call, pgm_transport_sendv3(), this provides a convenient mechanism to pass an application protocol header and payload separately without copy overhead.

The sendv2() pair provides an optimised mechanism of passing PGM payload size buffers which can be directly sent on the wire with pre-pended header.

Sudoku error correction

2008-03-25T10:22:00.012+08:00

Forward error correction (FEC) is a method of adding extra information (redundancy) to a message so that if any part is lost the data can be reconstructed without re-requesting from the sender. In a network protocol this is advantageous when either there is a significant number of receivers, e.g. internet radio, or the communications link to the sender is slow or expensive, e.g. deep space probes.

As a terse example of FEC, if a message comprised of nine numbers 1-9, and we added eight redundant numbers we end up with something like a Sudoku board:

As per the rules of Sudoku, if some numbers were missing we could determine the lost numbers from the neighbours on the same line or box.

Reed-Solomon encoding creates a graph based on a polynomial function that each point matches a byte in the data stream, x is the location in the stream, y is the value. For example, in the polynomial graph below imagine every red point being a byte of information in a transmission group. The graph can be extended to include extra data points, here marked in green. These points are extra redundant information, called parity data. As the parity points follow the same line it is possible to use these points to re-construct the original graph polynomial function. Once this function is calculated any missing real data points can be recovered by substituting the x location values.

The benefit over convential selective "Automatic Repeat reQuest" (ARQ), is that one parity point can recover any one lost original data point. The disadvantage is the extra time to perform the calculations, however in hardware systems these calculations can be implemented directly in hardware using a slightly different form called BCH Code.

Both forms of code are popular in software projects, notable examples include Luigi Rizzo's RMDP, Peter Brian Clements PAR Parity Archives, and Phil Karn's DSP and FEC library (e.g. Linux software modems). However the results are in different forms, Vandermonde calculations produce vector space coefficients, and BCH's Linear Shift Feedback Register produces polynomial space coefficients. Microsoft's PGM implementation uses Rizzo's implementation, and so for initial compatibility OpenPGM will use a Vandermonde matrix calculation.

Network system testing

2008-01-11T11:24:00.004+08:00

Testing is always helpful in development, large projects often undergo testing at different levels: unit testing, integration testing, and performance testing. With a multicast network protocol none of these cover actual testing of the protocol between hosts, so we create a new method: network system testing. We want to test the OpenPGM stack and the API it provides to the application developer as pictured below on the top right.

Some tests need an external source to drive functionality in the stack, the Simulator is used for this task. In order to verify the packets sent out by the stack are correct with have the Monitor.

In order to build an extensive set of tests that can be reliably re-run we want to use automated testing. This means some form of scripting of all three systems and synchronisation between how each is run. The Tester host runs a script that remotely controls and receives feedback from each of the three test systems. All communication is via stdin and stdout, including the monitor with is a glorified version of tcpdump but shows PGM packets in JSON form.

To make everything platform agnostic and to ease development all scripts are in Perl, modules can be used to SSH into remote hosts, perform high resolution timing, process JSON representation of PGM packets, etc.

Microsecond timing with millisecond clocks

2007-06-24T19:40:00.018+08:00

The standard C library used with GCC is glibc, it provides POSIX standard functions for timing, sleeping, etc. On Unix platforms such as Solaris on Sparc, HP/UX on PA-RISC these can provide very high resolution timing, nanosecond to microsecond. On Linux 2.6 the resolution is typically 4ms, earlier versions used to be 1ms but certain machine configurations would fail as the timing routine would take longer than 1ms to execute.

Special Linux kernel versions are appearing that support real time or 1ms or finer resolution, for example SUSE Linux Enterprise Real Time (SLERT) or Ubuntu Studio. Using the latter allows microsecond timing with the gettimeofday() function and usleep() to 1ms resolution. In order to get finer grain sleeps we have to create our own routines, a basic loop checking the current time until the microsecond period has expired will do. One caveat that on single core systems the thread in the loop is likely to take all the CPU time, we need to yield the processor to other threads if the timer hasn't expired. In Linux we can use sched_yield(), to be platform we would want to use pthread_yield() however this does not exist with NPTL threads so we can use the Glib thread API version g_thread_yield() instead.

A custom high resolution sleep function doesn't immediately help with a Glib abstracted event loop with timer management either. We need to add a new source to the event loop that can fire events at the new microsecond resolution. To implement this we can derive from the existing timer source base, if the requested sleep time has a low resolution component, e.g. 1.5ms, we can use the existing timer to sleep for 1ms then take over with our high resolution timer for the remaining 500us. The new source is an idle source, that is executes when no other high priority events need to be processed. Effectively Glib is going to run a select()/poll() with a timeout and then execute all the idle sources and repeat. With a low resolution timer the select()/poll() manages the timeout, for high resolution timing it runs with a zero timeout.

In a standard PGM transport we might expect hundreds to thousands of timers awaiting to be fired, from sending session keep alive messages (SPMs) to re-requesting lost data (NAKs). We want to minimize the number of high resolution timers, and minimise overhead of changing timers due to incoming data or receiver state changes and we can do that by managing the entire transport timers internally and presenting one global timer to the underlying Glib event loop. The following diagram shows the two sides of the transport, one of three timers per packet on the receive side: NAK_RB_IVL for NAK request back-off, NAK_RPT_IVL to repeat send a NAK, and NAK_RDATA_IVL to wait for a RDATA if seeing a NAK confirm (NCF); the send side includes an ambient SPM keeping the session alive, and heartbeat SPMs to help flush out trailing packets that might have been lost.

Once Linux implements high resolution timers for select()/poll() this method is no longer required and we should expect improved CPU usage on the timer thread.

Ceci n’est pas une pipe

2007-05-11T15:29:00.010+08:00

Signals on many platforms are completely not-thread safe, this being due to the delivery by an interrupt which can halt execution mid-way through a function call. Checking the man page signal(2) POSIX.1-2003 lists safe functions to call in a signal handler:

_Exit() _exit() abort() accept() access() aio_error() aio_return() aio_suspend() alarm() bind() cfgetispeed() cfgetospeed() cfsetispeed() cfsetospeed() chdir() chmod() chown() clock_gettime() close() connect() creat() dup() dup2() execle() execve() fchmod() fchown() fcntl() fdatasync() fork() fpathconf() fstat() fsync() ftruncate() getegid() geteuid() getgid() getgroups() getpeername() getpgrp() getpid() getppid() getsockname() getsockopt() getuid() kill() link() listen() lseek() lstat() mkdir() mkfifo() open() pathconf() pause() pipe() poll() posix_trace_event() pselect() raise() read() readlink() recv() recvfrom() recvmsg() rename() rmdir() select() sem_post() send() sendmsg() sendto() setgid() setpgid() setsid() setsockopt() setuid() shutdown() sigaction() sigaddset() sigdelset() sigemptyset() sigfillset() sigismember() signal() sigpause() sigpending() sigprocmask() sigqueue() sigset() sigsuspend() sleep() socket() socketpair() stat() symlink() sysconf() tcdrain() tcflow() tcflush() tcgetattr() tcgetpgrp() tcsendbreak() tcsetattr() tcsetpgrp() time() timer_getoverrun() timer_gettime() timer_settime() times() umask() uname() unlink() utime() wait() waitpid() write()

The popular method to handle signals is then through a pipe to an event loop, read "Catching Unix signals" for a Gtk example.

Using pipes is also a popular mechanism for multiple threads to communicate with each other, with the PGM transport the application needs to be notified only when contiguous data is available, handling of out of order sequence numbers and NAK requests should be transparent. However it only need be used as a thread-safe signalling mechanism, so for zero-copy we simply use a shared memory structure for the actual data to pass, in this case via a Glib asynchronous queue. A pipe can be used in a select() or poll() call, the thread can then sleep until data is available, otherwise a constant loop checking shared memory would be necessary with the side effect of starving other threads of processor time.

Like a jigsaw puzzle

2007-05-11T14:43:00.012+08:00

Having the pieces first obviously helps, and we have already created separate receive and transmit windows together with the necessary network socket details. We want to define a new object that incorporates both receiver and transmit side functionality and manages all the network specific details for us. Independently we can investigate what kind of API we want to see by creating new basic send and receiver tools: pgmsend and pgmrecv derived from previously created basic_recv_with_rxw and stream_send_with_nak. The following diagram shows all the components that are affected:

That's getting a bit complicated to view from a functional level so lets have a look at the combined data flow diagram:

The TX/RX queue refers to of the operating system, the asynchronous event queue and event loop is determinable upon the integration framework. Currently integration is with the Glib event loop however the event hooks can easily be redirected to a Windows native, Qt, or any other. It is also not necessary for the PGM event loop to be a separate from the application event loop, although only recommended for low data rate applications.

Cost of Time

2007-04-23T12:24:00.018+08:00

Network protocols have a heavy dependency on time: when should a packet be resent? Will I ever receive this packet? Is the other party still running? The PGM protocol defines many timers in the receiver for determining packet state: NAK_RB_IVL, NAK_RPT_IVL and NAK_RDATA_IVL. There are also many different methods of calculating time, from POSIX gettimeofday() & clock_gettime(), Windows QueryPerformanceCounter() & _ftime() to Intel's RDTSC & RDTSCP instructions. The Glib suite defines a GTimer to provide some abstraction but uses doubles and hence potential expensive floating-point math.

So one question is what kind of overhead can one expect with Glib timers? Here is a graph with timers:

Now removing the timers completely and re-running gives the following results:

The test series "sequence numbers in jumps" causes generation of many NAKs each requiring its own times tamp for expiry detection, 68% of the processing is simply getting the current time!

Gimme that packet

2007-04-19T17:52:00.017+08:00

So we're sending data with a transmit window to handle reliability how about a receive window to process, re-order, and request re-delivery of lost packets for reliable transfer? If we take a similar architecture to the transmit window we have something like this:

A fixed pointer array defines the maximum size of the receive window, at run time a container is assigned to function as a place holder for lost packets, or container for received data. Memory is pooled through a slab allocator and managed with a trash stack for optimum performance. The trail refers to the trailing edge of the non-contiguous data rather than RXW_TRAIL.

When a packet is received it is inserted into the receive window, if non-contiguous a series of place holders are generated which are used to manage the sequence number receive state as per the flow chart in the draft specification:

Flow chart of receive state as per draft RFC 3208.

In order to allow rapid timer expiration a series of queues are maintained for each receive state, the queues are made available for external access in order to protocol tweaking for either low latency (MDS), large object transfer (files), broadcast (video streaming) purposes.

After implementation of rxw.c we can perform basic performance tests (basic_rxw.c) to compare with the transmit window implementation. In order for a fair comparison of overheads we define three tests: one a basic fill of the receive window without committing data, two to fill in the window in reverse order, and a third to skip every other sequence number to alternate between inserting data and a place holder.

This graph shows that for basic fills performance exceeds the transmit window and worst case scenarios significantly lag behind but not overly unreasonably and little difference between 100k and 200k packets.

The magnitude of difference between send and receive side underscores some important design decisions that need to be made for implementation. In many typical environments the server host would be a high speed AMD64 Linux box whilst the clients are mid-speed Intel Windows boxes amplifying any disadvantage of receive side processing. So can we improve the receive side performance, for example by removing the place holder per sequence number and grouping together ranges? The results of a profile run:

Flat profile:

Each sample counts as 0.01 seconds.
%   cumulative   self              self     total
time   seconds   seconds    calls ms/call ms/call name
37.10      0.27     0.27 7200000     0.00     0.00 rxw_alloc
24.05      0.45     0.18 7200000     0.00     0.00 rxw_push
13.74      0.55     0.10 7200000     0.00     0.00 rxw_state_foreach
9.62      0.62     0.07 5400012     0.00     0.00 rxw_pkt_free1
6.87      0.67     0.05 8999988     0.00     0.00 rxw_alloc0_packet
5.50      0.71     0.04 5399940     0.00     0.00 rxw_pkt_state_unlink
1.37      0.72     0.01       12     0.83    15.75 test_basic_rxw
0.69      0.72     0.01 5400012     0.00     0.00 on_pgm_data
0.69      0.73     0.01 3599964     0.00     0.00 on_send_nak
0.00      0.73     0.00       48     0.00     0.00 rxw_window_update
0.00      0.73     0.00       12     0.00    14.91 test_fill
0.00      0.73     0.00       12     0.00    14.91 test_jump
0.00      0.73     0.00       12     0.00    14.91 test_reverse

These results show more time handling packets (61%) than place holders (21%) with 14% NAK list overhead, similarly with oprofile:

Flat profile:

Each sample counts as 1 samples.
%   cumulative   self              self     total
time   samples   samples    calls T1/call T1/call name
24.40 72479.00 72479.00                             rxw_push
17.14 123399.00 50920.00                             rxw_alloc
14.47 166397.00 42998.00                             rxw_state_foreach
13.18 205554.00 39157.00                             rxw_pkt_state_unlink
10.98 238170.00 32616.00                             rxw_pkt_free1
6.50 257488.00 19318.00                             rxw_alloc0_packet
6.45 276645.00 19157.00                             rxw_ncf
2.27 283389.00 6744.00                             on_pgm_data
1.32 287314.00 3925.00                             _init
0.86 289872.00 2558.00                             test_basic_rxw
0.77 292148.00 2276.00                             test_reverse
0.76 294413.00 2265.00                             test_jump
0.59 296154.00 1741.00                             test_fill
0.24 296877.00   723.00                             on_send_nak
0.07 297081.00   204.00                             on_wait_ncf
0.00 297084.00     3.00                             main
0.00 297085.00     1.00                             __libc_csu_init
0.00 297086.00     1.00                             rxw_window_update

41% time handling packets, 29% handling place holders with 15% NAK list overhead.

1 + 2 = 3

2007-04-11T17:26:00.019+08:00

In order to provide reliability the PGM protocol needs to be able to detect when packets have been corrupted, a double checksum is used, one by the operating system on the IP header and one in the PGM header for the entire PGM packet similarly to how UDP and TCP packets are described.

The IP header is often updated requiring the checksum to be recalculated by network elements, for example updating the multicast TTL in each router. For the payload modern network cards provide hardware checksum offload for UDP and TCP packets, however with PGM the checksum has to run in userspace so some tests are required to find an optimal routine. Aside from the actual calculation, which is a one's complement, a PGM API has to copy the payload from the application layer in order to add the PGM header (without I/O scatter gather) and store in the transmit window, we could calculate the checksum then memcpy() the packet or try to implement a joint checksum and copy routine.

First on a 3.2Ghz Intel Xeon.

The red line is a C based checksum and copy routine and leads a separate memcpy() and checksum to around 6KB packet size, an 64bit assembly routine from the Linux kernel performs worse above 1KB.

Now compare with a dual-core AMD Opteron based machine:

The separate checksum and memcpy() routines lead at 2KB, whilst the Linux assembly routines easily excel.

A quad-core Intel Xeon machine:

The assembly routine does significantly better than the original Xeon host, we need to convert tick time into real time to compare each graph though:

	3.2GhzIntel Xeon	1.6GhzQuad-core Xeon	2.4GhzDual-core Opteron
memcpy	2.66 ms	3.75 ms	2.46 ms
cksum	2.66 ms	2.81 ms	2.54 ms
linux	3.60 ms	2.12 ms	0.63 ms

The dual-core AMD Opteron is the clear winner for this computation.

I’ve Got my Bag Lets Go!

2007-04-11T16:43:00.013+08:00

So the results tell that a combination of containers is going to be useful, we can use a pre-allocated pointer array to store the details about each entry in the transmit window to gain the best access speed, and a trash stack based pointer system for the actual payload.

Its probable performance might be boosted further by using chunks of page size aligned data and sharing between several entries in the window. In so doing the overhead of generating or checking time stamps when inserting or purging from the window can be reduced. In this current stage of development we are I/O bound not CPU bound and so we shall revisit later when there is a greater surrounding framework, and burden on CPU that can highlight the difference.

The trash stack keeps freed packets and payloads allocated to the process for future use, a first in last out policy makes it cache friendly too. One important side effect is that memory stays in the transmit window system once allocated and will be unavailable to the application, but that is part of the rational of choosing the maximum transmit window size, either in bytes, sequence numbers, or time duration. Returning memory to the slice allocator would still keep the memory allocated to the process for application use but previous tests have shown at a latency cost. Using the system malloc instead of the slice allocator would be even slower but on Linux allow the memory to return to the operating system, however not all systems are the same, for example Solaris malloc never frees memory from the application.

Here you can see the implementation txw.c is slower than a basic singly linked list, this appears the overhead of using a pointer buffer instead of byte buffer for the packet details. To test this we compare the pointer buffer implementation with a byte buffer (txw-byte.c), and a byte buffer with pointer index (txw-bytep.c) in case the multiply is slow.

The results show that in fact the pointer array implementation is faster than a byte array.

How Big is that Bag?

2007-04-05T14:37:00.004+08:00

Standard ethernet packets usually 1,500 bytes long, on a typical home network this might vary because of ATM based internet connections, for high bandwidth environments this might increase with jumbo frames to 9,000 bytes and beyond with IPv6 jumbograms. So how does the size of a packet affect container performance in the transmit window?

The graph says it all, the different is minor.

みる ブログ

PGM_IO_STATUS_TIMER_PENDING

PGM_IO_STATUS_WOULD_BLOCK

PGM_IO_STATUS_NORMAL

4. Non-operational IPv4 adapters

A Cupcake and a Teredo

Real-Time Linux

Wherefore art thou IP packet? Make haste Windows 2008 R2

Strawberries and a Python through the Looking Glass

What is PGM? What is OpenPGM?

Miru Announces OpenPGM 5

Miru Announces OpenPGM 3

Miru Announces Full IPv6 and Windows Support with Sub 100-Microsecond Latency

Miru Ships Standards Based Low Latency Open Source Messaging Software

Flavours of Multicast

Send send send

Sudoku error correction

Network system testing

Microsecond timing with millisecond clocks

Ceci n’est pas une pipe

Like a jigsaw puzzle

Cost of Time

Gimme that packet

1 + 2 = 3

I’ve Got my Bag Lets Go!

How Big is that Bag?

みるブログ