Thursday 19 April 2007

Gimme that packet

So we're sending data with a transmit window to handle reliability how about a receive window to process, re-order, and request re-delivery of lost packets for reliable transfer? If we take a similar architecture to the transmit window we have something like this:


A fixed pointer array defines the maximum size of the receive window, at run time a container is assigned to function as a place holder for lost packets, or container for received data. Memory is pooled through a slab allocator and managed with a trash stack for optimum performance. The trail refers to the trailing edge of the non-contiguous data rather than RXW_TRAIL.

When a packet is received it is inserted into the receive window, if non-contiguous a series of place holders are generated which are used to manage the sequence number receive state as per the flow chart in the draft specification:


Flow chart of receive state as per draft RFC 3208.

In order to allow rapid timer expiration a series of queues are maintained for each receive state, the queues are made available for external access in order to protocol tweaking for either low latency (MDS), large object transfer (files), broadcast (video streaming) purposes.

After implementation of rxw.c we can perform basic performance tests (basic_rxw.c) to compare with the transmit window implementation. In order for a fair comparison of overheads we define three tests: one a basic fill of the receive window without committing data, two to fill in the window in reverse order, and a third to skip every other sequence number to alternate between inserting data and a place holder.


This graph shows that for basic fills performance exceeds the transmit window and worst case scenarios significantly lag behind but not overly unreasonably and little difference between 100k and 200k packets.

The magnitude of difference between send and receive side underscores some important design decisions that need to be made for implementation. In many typical environments the server host would be a high speed AMD64 Linux box whilst the clients are mid-speed Intel Windows boxes amplifying any disadvantage of receive side processing. So can we improve the receive side performance, for example by removing the place holder per sequence number and grouping together ranges? The results of a profile run:

Flat profile:

Each sample counts as 0.01 seconds.
%   cumulative   self              self     total
time   seconds   seconds    calls  ms/call  ms/call  name
37.10      0.27     0.27  7200000     0.00     0.00  rxw_alloc
24.05      0.45     0.18  7200000     0.00     0.00  rxw_push
13.74      0.55     0.10  7200000     0.00     0.00  rxw_state_foreach
9.62      0.62     0.07  5400012     0.00     0.00  rxw_pkt_free1
6.87      0.67     0.05  8999988     0.00     0.00  rxw_alloc0_packet
5.50      0.71     0.04  5399940     0.00     0.00  rxw_pkt_state_unlink
1.37      0.72     0.01       12     0.83    15.75  test_basic_rxw
0.69      0.72     0.01  5400012     0.00     0.00  on_pgm_data
0.69      0.73     0.01  3599964     0.00     0.00  on_send_nak
0.00      0.73     0.00       48     0.00     0.00  rxw_window_update
0.00      0.73     0.00       12     0.00    14.91  test_fill
0.00      0.73     0.00       12     0.00    14.91  test_jump
0.00      0.73     0.00       12     0.00    14.91  test_reverse


These results show more time handling packets (61%) than place holders (21%) with 14% NAK list overhead, similarly with oprofile:

Flat profile:

Each sample counts as 1 samples.
%   cumulative   self              self     total
time   samples   samples    calls  T1/call  T1/call  name
24.40  72479.00 72479.00                             rxw_push
17.14 123399.00 50920.00                             rxw_alloc
14.47 166397.00 42998.00                             rxw_state_foreach
13.18 205554.00 39157.00                             rxw_pkt_state_unlink
10.98 238170.00 32616.00                             rxw_pkt_free1
6.50 257488.00 19318.00                             rxw_alloc0_packet
6.45 276645.00 19157.00                             rxw_ncf
2.27 283389.00  6744.00                             on_pgm_data
1.32 287314.00  3925.00                             _init
0.86 289872.00  2558.00                             test_basic_rxw
0.77 292148.00  2276.00                             test_reverse
0.76 294413.00  2265.00                             test_jump
0.59 296154.00  1741.00                             test_fill
0.24 296877.00   723.00                             on_send_nak
0.07 297081.00   204.00                             on_wait_ncf
0.00 297084.00     3.00                             main
0.00 297085.00     1.00                             __libc_csu_init
0.00 297086.00     1.00                             rxw_window_update


41% time handling packets, 29% handling place holders with 15% NAK list overhead.

No comments:

Post a Comment