Wednesday 11 April 2007

1 + 2 = 3

In order to provide reliability the PGM protocol needs to be able to detect when packets have been corrupted, a double checksum is used, one by the operating system on the IP header and one in the PGM header for the entire PGM packet similarly to how UDP and TCP packets are described.


The IP header is often updated requiring the checksum to be recalculated by network elements, for example updating the multicast TTL in each router. For the payload modern network cards provide hardware checksum offload for UDP and TCP packets, however with PGM the checksum has to run in userspace so some tests are required to find an optimal routine. Aside from the actual calculation, which is a one's complement, a PGM API has to copy the payload from the application layer in order to add the PGM header (without I/O scatter gather) and store in the transmit window, we could calculate the checksum then memcpy() the packet or try to implement a joint checksum and copy routine.

First on a 3.2Ghz Intel Xeon.


The red line is a C based checksum and copy routine and leads a separate memcpy() and checksum to around 6KB packet size, an 64bit assembly routine from the Linux kernel performs worse above 1KB.

Now compare with a dual-core AMD Opteron based machine:


The separate checksum and memcpy() routines lead at 2KB, whilst the Linux assembly routines easily excel.

A quad-core Intel Xeon machine:


The assembly routine does significantly better than the original Xeon host, we need to convert tick time into real time to compare each graph though:

3.2GhzIntel Xeon
1.6GhzQuad-core Xeon
2.4GhzDual-core Opteron
memcpy
2.66 ms
3.75 ms
2.46 ms
cksum
2.66 ms
2.81 ms
2.54 ms
linux
3.60 ms
2.12 ms
0.63 ms

The dual-core AMD Opteron is the clear winner for this computation.

No comments:

Post a Comment