Difference: UnrolledContents (1 vs. 25)

Revision 252007-08-17 - TobyRodwell

Line: 1 to 1
 
META TOPICPARENT name="AutoToc"

Unrolled PERTKB Contents

Line: 21 to 21
 

Delay Variation ("Jitter")

Delay variation or "jitter" is a metric that describes the level of disturbance of packet arrival times with respect to an "ideal" pattern, typically the pattern in which the packets were sent. Such disturbances can be caused by competing traffic (i.e. queueing), or by contention on processing resources in the network.

RFC 3393 defines an IP Delay Variation Metric (IPDV). This particular metric only compares the delays experienced by packets of equal size, on the grounds that delay is naturally dependent on packet size, because of serialization delay.

Delay variation is an issue for real-time applications such as audio/video conferencing systems. They usually employ a jitter buffer to eliminate the effects of delay variation.

Delay variation is related to packet reordering. But note that the RFC 3393 IPDV of a network can be arbitrarily low, even zero, even though that network reorders packets, because the IPDV metric only compares delays of equal-sized packets.

Decomposition

Jitter is usually introduced in network nodes (routers), as an effect of queueing or contention for forwarding resources, especially on CPU-based router architectures. Some types of links can also introduce jitter, for example through collision avoidance (shared Ethernet) or link-level retransmission (802.11 wireless LANs).

Measurement

Contrary to one-way delay, one-way delay variation can be measured without requiring precisely synchronized clocks at the measurement endpoints. Many tools that measure one-way delay also provide delay variation measurements.

References

The IETF IPPM (IP Performance Metrics) Working Group has formalized metrics for IPDV, and more recently started work on an "applicability statement" that explains how IPDV can be used in practice and what issues have to be considered.

  • RFC 3393, IP Packet Delay Variation Metric for IP Performance Metrics (IPPM), C. Demichelis, P. Chimento. November 2002.
  • draft-morton-ippm-delay-var-as-03.txt, Packet Delay Variation Applicability Statement, A. Morton, B. Claise, July 2007.

-- SimonLeinen - 28 Oct 2004 - 24 Jul 2007

Packet Loss

Packet loss is the probability of a packet to be lost in transit from a source to a destination.

A One-way Packet Loss Metric for IPPM is defined in RFC 2680. RFC 3357 contains One-way Loss Pattern Sample Metrics.

Decomposition

There are two main reasons for packet loss:

Congestion

When the offered load exceeds the capacity of a part of the network, packets are buffered in queues. Since these buffers are also of limited capacity, severe congestion can lead to queue overflows, which lead to packet drops. In this context, severe congestion could mean that a moderate overload condition holds for an extended amount of time, but could also consist of the sudden arrival of a very large amount of traffic (burst).

Errors

Another reason for loss of packets is corruption, where parts of the packet are modified in-transit. When such corruptions happen on a link (due to noisy lines etc.), this is usually detected by a link-layer checksum at the receiving end, which then discards the packet.

Impact on end-to-end performance

Bulk data transfers usually require reliable transmission, so lost packets must be retransmitted. In addition, congestion-sensitive protocols such as TCP must assume that packet loss is due to congestion, and reduce their transmission rate accordingly (although recently there have been some proposals to allow TCP to identify non-congestion related losses and treat those differently).

For real-time applications such as conversational audio/video, it usually doesn't make much sense to retransmit lost packets, because the retransmitted copy would arrive too late (see delay variation). The result of packet loss is usually a degradation in sound or image quality. Some modern audio/video codecs provide a level of robustness to loss, so that the effect of occasional lost packets are benign. On the other hand, some of the most effective image compression methods are very sensitive to loss, in particular those that use relatively rare "anchor frames", and that represent the intermediate frames by compressed differences to these anchor frames - when such an anchor frame is lost, many other frames won't be able to be reconstructed.

Measurement

Packet loss can be actively measured by sending a set of packets from a source to a destination and comparing the number of received packets against the number of packets sent.

Network elements such as routers also contain counters for events such as checksum errors or queue drops, which can be retrieved through protocols such as SNMP. When this kind of access is available, this can point to the location and cause of packet losses.

Reducing packet loss

Congestion-induced packet loss can be avoided by proper provisioning of link capacities. Depending on the probability of bursts (which is somewhat difficult to estimate, taking into account both link capacities in the network, and the traffic rates and patterns of a possibly large number of hosts at the edge of the network), buffers in network elements such as routers must also be sufficient. Note that large buffers can be detrimental to other aspects of network performance, in particular one-way delay (and thus round-trip time) and delay variation.

Quality-of-Service mechanisms such as DiffServ or IntServ can be used to protect some subset of traffic against loss, but this necessarily increases packet loss for the remaining traffic.

Lastly, Active Queue Management (AQM) and Excplicit Congestion Notification (ECN) can be used to mitigate both packet loss and queueing delay (and thus one-way delay, round-trip time and delay variation).

References

  • A One-way Packet Loss Metric for IPPM, G. Almes, S. Kalidindi, M. Zekauskas, September 1999, RFC 2680
  • Improving Accuracy in Endtoend Packet Loss Measurement, J. Sommers, P. Barford, N. Duffield, A. Ron, August 2005, SIGCOMM'05 (PDF)
-- SimonLeinen - 01 Nov 2004

Packet Reordering

The Internet Protocol (IP) does not guarantee that packets are delivered in the order in which they were sent. This was a deliberate design choice that distinguishes IP from protocols such as, for instance, ATM and IEEE 802.3 Ethernet.

Decomposition

Reasons why a network may reorder packets: Usually because of some kind of parallelism, either because of a choice of alternative routes (Equal Cost Multipath, ECMP), or because of internal parallelism inside switching elements such as routers. One particular kind of packet reordering concerns packets of different sizes. Because a larger packet takes longer to transfer over a serial link (or a limited-width backplane inside a router), larger packets may be "overtaken" by smaller packets that were sent subsequently. This is usually not a concern for high-speed bulk transfers - where the segments tend to be equal-sized (hopefully Path MTU-sized), but may pose problems for naive implementations of multi-media (Audio/Video) transport.

Impact on end-to-end performance

In principle, applications that use a transport protocol such as TCP or SCTP don't have to worry about packet reordering, because the transport protocol is responsible for reassembling the byte stream (message stream(s) in the case of SCTP) into the original ordering. However, reordering can have a severe performance impact on some implementations of TCP. Recent TCP implementations, in particular those that support Selective Acknowledgements (SACK), can exhibit robust performance even in the face of reordering in the network.

Real-time media applications such as audio/video conferencing tools often experience problems when run over networks that reorder packets. This is somewhat remarkable in that all of these applications have jitter buffers to eliminate the effects of delay variation on the real-time media streams. Obviously, the code that manages these jitter buffers is often not written in a way to accomodate reordered packets sensibly, although this could be done with moderate effort.

Measurement

Packet reordering is measured by sending a numbered sequence of packets, and comparing the received sequence number sequence with the original. There are many possible ways of quantifying reordering, some of which are described in the IPPM Working Group's documents (see the references below).

The measurements here can be done differently, depending on the measurement purpose:

a) measuring reordering for particular application can be done by capturing the application traffic (e.g. using the Wireshark/Ethereal tool), injecting the same traffic pattern via traffic generator and calculating the reordering.

b) measuring maximal reordering introduced by the network can be done by injecting relatively small amount of traffic, shaped as a short bursts of long packets immediately followed by short burst of short packets, with line rate. After capture and calculation on the other end of the path, the results will reflect the worst possible packet reordering situation which may occur on particular path.

For more information please refer to the measurement page, available at packet reordering - measurement site. Please notice, that although the tool is based on older versions of reordering drafts, the metrics used are compatible with the definitions from new ones.

In particular Reorder Density and Reorder Buffer-Occupancy Density can be obtained from tcpdump by subjecting the dump files to RATtool, available at Reordering Analysis Tool Website

Improving reordering

The probability of reordering can be reduced by avoiding parallelism in the network, or by using network nodes that take care to use parallel paths in such a way that packets belonging to the same end-to-end flow are kept on a single path. This can be ensured by using a hash on the destination address, or the (source address, destination address) pair to select from the set of available paths. For certain types of parallelism this is hard to achieve, for example when a simple striping mechanism is used to distribute packets from a very high-speed interface to multiple forwarding paths.

References

-- SimonLeinen - 2004-10-28 - 2014-12-29
-- MichalPrzybylski - 2005-07-19
-- BartoszBelter - 2006-03-28

Changed:
<
<
Warning: Can't find topic PERTKB.MaximumTransferUnit
>
>

Maximum Transmission Unit (MTU)

The MTU (or to be exact, 'protocol MTU') of a link (logical IP subnet) describes the maximum size of an IP packet that can be transferred over the link without fragmentation. Common MTUs include

  • 1500 bytes (Ethernet, 802.11 WLAN)
  • 4470 bytes (FDDI, common default for POS and serial links)
  • 9000 bytes (Internet2 and GÉANT convention, limit of some Gigabit Ethernet adapters)
  • 9180 bytes (ATM, SMDS)

For entire network paths, see PathMTU. For specific information on configuring large (Jumbo) MTUs, see JumboMTU.

The term 'media MTU' is used to refer to the maximum sized Layer 2 PDU that a given interface can support. Media MTU bust equal to or greater the sum of protocol MTU and the Layer 2 header.

References

-- SimonLeinen - 27 Oct 2004
-- TobyRodwell - 24 Jan 2005
-- MarcoMarletta - 20 Jan 2006

 

Path MTU

The Path MTU is the Maximum Transmission Unit (MTU) supported by a network path. It is the minimum of the MTUs of the links (segments) that make up the path. Larger Path MTUs generally allow for more efficient data transfers, because source and destination hosts, as well as the switching devices (routers) along the network path have to process fewer packets. However, it should be noted that modern high-speed network adapters have mechanisms such as LSO (Large Send Offload) and Interrupt Coalescence that diminish the influence of MTUs on performance. Furthermore, routers are typically dimensioned to sustain very high packet loads (so that they can resist denial-of-service attacks) so the packet rates caused by high-speed transfers is not normally an issue for today's high-speed networks.

The prevalent Path MTU on the Internet is now 1500 bytes, the Ethernet MTU. There are some initiatives to support larger MTUs (JumboMTU) in networks, in particular on research networks. But their usability is hampered by last-mile issues and lack of robustness of RFC 1191 Path MTU Discovery.

Path MTU Discovery Mechanisms

Traditional (RFC1191) Path MTU Discovery

RFC 1191 describes a method for a sender to detect the Path MTU to a given receiver. (RFC 1981 describes the equivalent for IPv6.) The method works as follows:

  • The sending host will send packets to off-subnet destinations with the "Don't Fragment" bit set in the IP header.
  • The sending host keeps a cache containing Path MTU estimates per destination host address. This cache is often implemented as an extension of the routing table.
  • The Path MTU estimate for a new destination address is initialized to the MTU of the outgoing interface over which the destination is reached according to the local routing table.
  • When the sending host receives an ICMP "Too Big" (or "Fragmentation Needed and Don't Fragment Bit Set") destination-unreachable message, it learns that the Path MTU to that destination is smaller than previously assumed, and updates the estimate accordingly.
  • Normally, an ICMP "Too Big" message contains the next-hop MTU, and the sending host will use that as the new Path MTU estimate. The estimate can still be wrong because a subsequent link on the path may have an even smaller MTU.
  • For destination addresses with a Path MTU estimate lower than the outgoing interface MTU, the sending host will occasionally attempt to raise the estimate, in case the path has changed to support a larger MTU.
  • When trying ("probing") a larger Path MTU, the sending host can use a list of "common" MTUs, such as the MTUs associated with popular link layers, perhaps combined with popular tunneling overheads. This list can also be used to guess a smaller MTU in case an ICMP "Too Big" message is received that doesn't include any information about the next-hop MTU (maybe from a very very old router).

This method is widely implemented, but is not robust in today's Internet because it relies on ICMP packets sent by routers along the path. Such packets are often suppressed either at the router that should generate them (to protect its resources) or on the way back to the source, because of firewalls and other packet filters or rate limitations. These problems are described in RFC2923. When packets are lost due to MTU issues without any ICMP "Too Big" message, this is sometimes called a (MTU) black hole. Some operating systems have added heuristics to detect such black holes and work around them. Workarounds can include lowering the MTU estimate or disabling PMTUD for certain destinations.

Packetization Layer Path MTU Discovery (PLPMTUD, RFC 4821)

An IETF Working Group (pmtud) was chartered to define a new mechanism for Path MTU Discovery to solve these issues. This process resulted in RFC 4821, Packetization Layer Path MTU Discovery ("PLPMTUD"), which was published in March 2007. This scheme requires cooperation from a network layer above IP, namely the layer that performs "packetization". This could be TCP, but could also be a layer above UDP, let's say an RPC or file transfer protocol. PLPMTUD does not require ICMP messages. The sending packetization layer starts with small packets, and probes progressively larger sizes. When there's an indication that a larger packet was successfully transmitted to the destination (presumably because some sort of ACK was received), the Path MTU estimate is raised accordingly.

When a large packet was lost, this might have been due to an MTU limitation, but it might also be due to other causes, such as congestion or a transmission error - or maybe it's just the ACK that was lost! PLPMTUD recommends that the first time this happens, the sending packetization layer should assume an MTU issue, and try smaller packets. An isolated incident need not be interpreted as an indication of congestion.

An implementation of the new scheme for the Linux kernel was integrated into version 2.6.17. It is controlled by a "sysctl" value that can be observed and set through /proc/sys/net/ipv4/tcp_mtu_probing. Possible values are:

  • 0: Don't perform PLPMTUD
  • 1: Perform PLPMTUD only after detecting a "blackhole" in old-style PMTUD
  • 2: Always perform PLPMTUD, and use the value of tcp_base_mss as the initial MSS.

A user-space implementation over UDP is included in the VFER bulk transfer tool.

References

Implementations

  • for Linux 2.6 - integrated in mainstream kernel as of 2.6.17. However, it is disabled by default (see net.ipv4.tcp_mtu_probing sysctl)
  • for NetBSD

-- Hank Nussbacher - 2005-07-03 -- Simon Leinen - 2006-07-19 - 2018-07-02

Network Protocols

This section contains information about a few common Internet protocols, with a focus on performance questions.

For information about higher-level protocols, see under application protocols.

-- SimonLeinen - 2004-10-31 - 2015-04-26

TCP (Transmission Control Protocol)

The Transmission Control Protocol (TCP) is the prevalent transport protocol used on the Internet today. It uses window-based transmission to provide the service of a reliable byte stream, and adapts the rate of transfer to the state (of congestion) of the network and the receiver. Basic mechanisms include:

  • Segments that fit into IP packets, into which the byte-stream is split by the sender,
  • a checksum for each segment,
  • a Window, which bounds the amount of data "in flight" between the sender and the receiver,
  • Acknowledgements, by which the receiver tells the sender about segments that were received successfully,
  • a flow-control mechanism, which regulates data rate as a function of network and receiver conditions.

Originally specified in September 1981 RFC 793, TCP was clarified, refined and extended in many documents. Perhaps most importantly, congestion control was introduced in "TCP Tahoe" in 1988, described in Van Jacobson's 1988 SIGCOMM article on "Congestion Avoidance and Control". It can be said that TCP's congestion control is what keeps the Internet working when links are overloaded. In today's Internet, the enhanced "Reno" variant of congestion control is probably the most widespread.

RFC 7323 (formerly RFC 1323) specifies a set of "TCP Extensions for High Performance", namely the Window Scaling Option, which provides for much larger windows than the original 64K, the Timestamp Option and the PAWS (Protection Against Wrapped Sequence numbers) mechanism. These extensions are supported by most contemporary TCP stacks, although they frequently must be activated explicitly (or implicitly by configuring large TCP windows).

Another widely implemented performance enhancement to TCP are selective acknowledgements (SACK, RFC 2018). In TCP as originally specified, the acknowledgements ("ACKs") sent by a receiver were always "cumulative", that is, they specified the last byte of the part of the stream that was completely received. This is advantageous with large TCP windows, in particular where chances are high that multiple segments within a single window are lost. RFC 2883 describes an extension to SACK which makes TCP more robust in the presence of packet reordering in the network.

In addition, RFC 3449 - TCP Performance Implications of Network Path Asymmetry provides excellent information since a vast majority of Internet connections are asymmetrical.

Further Reading

References

  • RFC 7414, A Roadmap for Transmission Control Protocol (TCP) Specification Documents]], M. Duke, R. Braden, W. Eddy, E. Blanton, A. Zimmermann, February 2015
  • RFC 793, Transmission Control Protocol]], J. Postel, September 1981
  • draft-ietf-tcpm-rfc793bis-10, Transmission Control Protocol Specification, Wesley M. Eddy, July 2018
  • RFC 7323, TCP Extensions for High Performance, D. Borman, B. Braden, V. Jacobson, B. Scheffenegger, September 2014 (obsoletes RFC 1323 from May 1993)
  • RFC 2018, TCP Selective Acknowledgement Options]], M. Mathis, J. Mahdavi, S. Floyd, A. Romanow, October 1996
  • RFC 5681, TCP Congestion Control]], M. Allman, V. Paxson, E. Blanton, September 2009
  • RFC 3449, TCP Performance Implications of Network Path Asymmetry]], H. Balakrishnan, V. Padmanabhan, G. Fairhurst, M. Sooriyabandara, December 2002

There is ample literature on TCP, in particular research literature on its performance and large-scale behaviour.

Juniper has a nice white paper, Supporting Differentiated Service Classes: TCP Congestion Control Mechanisms (PDF format) explaining TCP's congestion control as well as many of the enhancements proposed over the years.

-- Simon Leinen - 2004-10-27 - 2018-07-02 -- Ulrich Schmid - 2005-05-31

Revision 242007-01-25 - SimonLeinen

Line: 1 to 1
 
META TOPICPARENT name="AutoToc"

Unrolled PERTKB Contents

Line: 77 to 77
 

SmokePing

Smokeping is a software-based measurement framework that uses various software modules (called probes) to measure round trip times, packet losses and availability at layer 3 (IPv4 and IPv6) and even applications? latencies. Layer 3 measurements are based on the ping tool and for analysis of applications there are such probes as for measuring DNS lookup and RADIUS authentication latencies. The measurements are centrally controlled on a single host from which the software probes are started, which in turn emit active measurement flows. The load impact to the network by these streams is usually negligible.

As with MRTG the results are stored in RRD databases in the original polling time intervals for 24 hours and then aggregated over time, and the peak values and mean values on a 24h interval are stored for more than one year. Like MRTG, the results are usually displayed in a web browser, using html embedded graphs in daily, monthly, weekly and yearly timeframes. A particular strength of smoke ping is the graphical manner in which it displays the statistical distribution of latency values over time.

The tool has also an alarm feature that, based on flexible threshold rules, either sends out emails or runs external scripts.

The following picture shows an example output of a weekly graph. The background colour indicates the link availability, and the foreground lines display the mean round trip times. The shadows around the lines indicate graphically about the statistical distribution of the measured round-trip times.

  • Example SmokePing graph:
    Example SmokePing graph

The 2.4* versions of Smokeping included SmokeTrace, an AJAX-based traceroute tool similar to mtr, but browser/server based. This was removed in 2.5.0 in favor of a separate, more general, tool called remOcular.

In August 2013, Tobi Oetiker announced that he received funding for the development of SmokePing 3.0. This will use Extopus as a front-end. This new version will be developed in public on Github. Another plan is to move to an event-based design that will make Smokeping more efficient and allow it to scale to a large number of probes.

Related Tools

The OpenNMS network management platform includes a tool called StrafePing which is heavily inspired by SmokePing.

References

-- SimonLeinen - 2006-04-07 - 2013-11-03

OWAMP (One-Way Active Measurement Tool)

OWAMP is a command line client application and a policy daemon used to determine one way latencies between hosts. It is an implementation of the OWAMP protocol (see references) that is currently going through the standardization process within the IPPM WG in the IETF.

With roundtrip-based measurements, it is hard to isolate the direction in which congestion is experienced. One-way measurements solve this problem and make the direction of congestion immediately apparent. Since traffic can be asymmetric at many sites that are primarily producers or consumers of data, this allows for more informative measurements. One-way measurements allow the user to better isolate the effects of specific parts of a network on the treatment of traffic.

The current OWAMP implementation (V3.1) supports IPv4 and IPv6.

Example using owping (OWAMP v3.1) on a linux host:

welti@atitlan:~$ owping ezmp3
Approximately 13.0 seconds until results available

--- owping statistics from [2001:620:0:114:21b:78ff:fe30:2974]:52887 to [ezmp3]:41530 ---
SID:    feae18e9cd32e3ee5806d0d490df41bd
first:  2009-02-03T16:40:31.678
last:   2009-02-03T16:40:40.894
100 sent, 0 lost (0.000%), 0 duplicates
one-way delay min/median/max = 2.4/2.5/3.39 ms, (err=0.43 ms)
one-way jitter = 0 ms (P95-P50)
TTL not reported
no reordering


--- owping statistics from [ezmp3]:41531 to [2001:620:0:114:21b:78ff:fe30:2974]:42315 ---
SID:    fe302974cd32e3ee7ea1a5004fddc6ff
first:  2009-02-03T16:40:31.479
last:   2009-02-03T16:40:40.685
100 sent, 0 lost (0.000%), 0 duplicates
one-way delay min/median/max = 1.99/2.1/2.06 ms, (err=0.43 ms)
one-way jitter = 0 ms (P95-P50)
TTL not reported
no reordering

References

-- ChrisWelti - 03 Apr 2006 - 03 Feb 2009 -- SimonLeinen - 29 Mar 2006 - 01 Dec 2008

Active Measurement Boxes

-- FrancoisXavierAndreu & SimonLeinen - 06 Jun 2005 - 06 Jul 2005

Changed:
<
<
Warning: Can't find topic PERTKB.IppmBoxes
>
>

HADES

Hades Active Delay Evaluation System (HADES) devices (previously called IPPM devices) were developed by the WiN-Labor at RRZE (Regional Computing Centre Erlangen), to provide QoS measurements in DFN's G-WiN infrastructure based on the metrics developed by the IETF IPPM WG. In addition to DFN's backbone, HADES devices have been deployed in numerous GEANT/GEANT2 Points of Prescence (PoPs).

Example Output

The examples that follow were generated using the new (experimental) HADES front-end in early June 2006.

One-Way Delay Plots

A typical one-way delay plot looks like this.

  • IPPM Example: FCCN-GARR one-way delays:
    IPPM Example: FCCN-GARR one-way delays

The default view selects the scale on the y (delay) axis so that all values fit in. In this case we see that there were two cases (around 12:30 and around 18:10) where there were "outliers" with delays up to 190ms, much higher than the average of about 30ms.

In the presence of such outliers, the auto-scaling feature makes it hard to discern variations in the non-outlying values. Fortunately, the new interface allows us to select the y axis scale by hand, so that we can zoom in to the delay range that we are interested in:

  • IPPM Example: FCCN-GARR one-way delay, narrower delay scale:
    IPPM Example: FCCN-GARR one-way delay, narrower delay scale

It now becomes evident that most of the samples lie in a very narrow band between 29.25 and 29.5 milliseconds. Interestingly, there are a few outliers towards lower delays. These could be artifacts of clock inaccuracies, or they could point to some kind of routing anomaly, although the latter seems less probable, because routing anomalies normally don't lead to shorter delays.

Concluding Remarks

Note that the IPPM system is being developed for the NREN community, hence its feature developments focus on that specific community needs; for example, it implements measurement of out of order packets as well as metric measurements for different IP Precedence values.

References

-- FrancoisXavierAndreu & SimonLeinen - 06 Jun 2005 - 01 Jun 2006

 

RIPE TTM Boxes

Summary

The Test Traffic Measurements service has been offered by the RIPE NCC as a service since the year 2000. The system continuously records one-way delay and packet-loss measurements as well as router-level paths ("traceroutes") between a large set of probe machines ("test boxes"). The test boxes are hosted by many different organisations, including NRENs, commercial ISPs, universities and others, and usually maintained by RIPE NCC as part of the service. While the vast majority of the test boxes is in Europe, there are a couple of machines in other continents, including outside the RIPE NCC's service area. Every test box includes a precise time source - typically a GPS receiver - so that accurate one-way delay measurements can be provided. Measurement data are entered in a central database at RIPE NCC's premises every night. The database is based on CERN's ROOT system. Measurement results can be retrieved over the Web using various presentations, both pre-generated and "on-demand".

Applicability

RIPE TTM data is often useful to find out "historical" quality information about a network path of interest, provided that TTM test boxes are deployed near (in the topological sense, not just geographically) the ends of the path. For research network paths throughout Europe, the coverage of the RIPE TTM infrastructure is not complete, but quite comprehensive. When one suspects that there is (non-temporary) congestion on a path covered by the RIPE TTM measurement infrastructure, the TTM graphs can be used to easily verify that, because such congestion will show up as delay and/or loss if present.

The RIPE TTM system can also be used by operators to precisely measure - but only after the fact - the impact of changes in the network. These changes can include disturbances such as network failures or misconfigurations, but also things such as link upgrades or routing improvements.

Notes

The TTM project has provided extremely high-quality and stable delay and loss measurements for a large set of network paths (IPv4 and recently also IPv6) throughout Europe. These paths cover an interesting mix of research and commercial networks. The Web interface to the collected measurements supports in-depth exploration quite well, such as looking at the delay/loss evolution of a specific path over both a wide range of intervals from the very short to the very long. On the other hand, it is hard to get at useful "overview" pictures. The RIPE NCC's DNS Server Monitoring (DNSMON) service provides such overviews for specific sets of locations.

The raw data collected by the RIPE TTM infrastructure is not generally available, in part because of restrictions on data disclosure. This is understandable in that RIPE NCC's main member/customer base consists of commercial ISPs, who presumably wouldn't allow full disclosure for competitive reasons. But it also means that in practice, only the RIPE NCC can develop new tools that make use of these data, and their resources for doing so are limited, in part due to lack of support from their ISP membership. On a positive note, the RIPE NCC has, on several occasions, given scientists access to the measurement infrastructure for research. The TTM team also offers to extract raw data (in ROOT format) from the TTM database manually on demand.

Another drawback of RIPE TTM is that the central TTM database is only updated once a day. This makes the system impossible to use for near-real-time diagnostic purposes. (For test box hosts, it is possible to access the data collected by one's hosted test boxes in near realy time, but this provides only very limited functionality, because only information about "inbound" packets is available locally, and therefore one can neither correlate delays with path changes, nor can one compute loss figures without access to the sending host's data.) In contrast, RIPE NCC's Routing Information Service (RIS) infrastructure provides several ways to get at the collected data almost as it is collected. And because RIS' raw data is openly available in a well-defined and relatively easy-to-use format, it is possible for third parties to develop innovative and useful ways to look at that data.

Examples

The following screenshot shows an on-demand graph of the delays over a 24-hour period from one test box hosted by SWITCH (tt85 at ETH Zurich) to to other one (tt86 at the University of Geneva). The delay range has been narrowed so that the fine-grained distribution can be seen. Five packets are listed as "overflow" under "STATISTICS - Delay & Hops" because their one-way delays exceeded the range of the graph. This is an example of an almost completely uncongested path, showing no packet loss and the vast majority of delays very close to the "baseline" value.

ripe-ttm-sample-delay-on-demand.png

The following plot shows the impact of a routing change on a path to a test box in New Zealand (tt47). The hop count decreases by one, but the base one-way delay increased by about 20ms. The old and new routes can be retrieved from the Web interface too. Note also that the delay values are spread out over a fairly large range, which indicates that there is some congestion (and thus queueing) on the path. The loss rate over the entire day is 1.1%. It is somewhat interesting how much of this is due to the routing change and how much (if any) is due to "normal" congestion. The lower right graph shows "arrived/lost" packets over time and indicates that most packets were in fact lost during the path change - so although some congestion is visible in the delays, it is not severe enough to cause visible loss. On the other hand, the total number of probe packets sent (about 2850 per day) means that congestion loss would probably go unnoticed until it reaches a level of 0.1% or so.

day.tt85.tt47.gif

References

-- FrancoisXavierAndreu & SimonLeinen - 06 Jun 2005 - 06 Jul 2005

QoSMetrics Boxes

This measurement infrastructure was built as part of RENATER's Métrologie efforts. The infrastructure performs continuous delay measurement between a mesh of measurement points on the RENATER backbone. The measurements are sent to a central server every minute, and are used to produce a delay matrix and individual delay histories (IPv4 and IPv6 measures, soon for each CoS)).

Please note that all results (tables and graphs) presented on this page are genereted by RENATER's scripts from QoSMetrics MySQL Database.

Screendump from RENATER site (http://pasillo.renater.fr/metrologie/get_qosmetrics_results.php)

new_Renater_IPPM_results_screendump.jpg

Example of asymmetric path:

(after that the problem was resolve, only one path has his hop number decreased)

Graphs:

Paris_Toulouse_delay Toulouse_Paris_delay
Paris_Toulouse_jitter Toulouse_Paris_jitter
Paris_Toulouse_hop_number Toulouse_Paris_hop_number
Paris_Toulouse_pktsLoss Toulouse_Paris_pktsLoss

Traceroute results:

traceroute to 193.51.182.xx (193.51.182.xx), 30 hops max, 38 byte packets
1 209.renater.fr (195.98.238.209) 0.491 ms 0.440 ms 0.492 ms
2 gw1-renater.renater.fr (193.49.159.249) 0.223 ms 0.219 ms 0.252 ms
3 nri-c-g3-0-50.cssi.renater.fr (193.51.182.6) 0.726 ms 0.850 ms 0.988 ms
4 nri-b-g14-0-0-101.cssi.renater.fr (193.51.187.18) 11.478 ms 11.336 ms 11.102 ms
5 orleans-pos2-0.cssi.renater.fr (193.51.179.66) 50.084 ms 196.498 ms 92.930 ms
6 poitiers-pos4-0.cssi.renater.fr (193.51.180.29) 11.471 ms 11.459 ms 11.354 ms
7 bordeaux-pos1-0.cssi.renater.fr (193.51.179.254) 11.729 ms 11.590 ms 11.482 ms
8 toulouse-pos1-0.cssi.renater.fr (193.51.180.14) 17.471 ms 17.463 ms 17.101 ms
9 xx.renater.fr (193.51.182.xx) 17.598 ms 17.555 ms 17.600 ms

[root@CICT root]# traceroute 195.98.238.xx
traceroute to 195.98.238.xx (195.98.238.xx), 30 hops max, 38 byte packets
1 toulouse-g3-0-20.cssi.renater.fr (193.51.182.202) 0.200 ms 0.189 ms 0.111 ms
2 bordeaux-pos2-0.cssi.renater.fr (193.51.180.13) 16.850 ms 16.836 ms 16.850 ms
3 poitiers-pos1-0.cssi.renater.fr (193.51.179.253) 16.728 ms 16.710 ms 16.725 ms
4 nri-a-pos5-0.cssi.renater.fr (193.51.179.17) 22.969 ms 22.956 ms 22.972 ms
5 nri-c-g3-0-0-101.cssi.renater.fr (193.51.187.21) 22.972 ms 22.961 ms 22.844 ms
6 gip-nri-c.cssi.renater.fr (193.51.182.5) 17.603 ms 17.582 ms 17.616 ms
7 250.renater.fr (193.49.159.250) 17.836 ms 17.718 ms 17.719 ms
8 xx.renater.fr (195.98.238.xx) 17.606 ms 17.707 ms 17.226 ms

-- FrancoisXavierAndreu & SimonLeinen - 06 Jun 2005 - 06 Jul 2005 - 07 Apr 2006

Passive Measurement Tools

Network Usage:

  • SNMP-based tools retrieve network information such as state and utilization of links, router CPU loads, etc.: MRTG, Cricket
  • Netflow-based tools use flow-based accounting information from routers for traffic analysis, detection of routing and security problems, denial-of-service attacks etc.

Traffic Capture and Analysis Tools

-- FrancoisXavierAndreu - 06 Jun 2005 -- SimonLeinen - 05 Jan 2006

Revision 232006-11-17 - PekkaSavola

Line: 1 to 1
 
META TOPICPARENT name="AutoToc"

Unrolled PERTKB Contents

Line: 112 to 112
 

Operating-System Specific Configuration Hints

This topic points to configuration hints for specific operating systems.

Note that the main perspective of these hints is how to achieve good performance in a big bandwidth-delay-product environment, typically with a single stream. See the ServerScaling topic for some guidance on tuning servers for many concurrent connections.

-- SimonLeinen - 27 Oct 2004 - 02 Nov 2008
-- PekkaSavola - 02 Oct 2008

OS-Specific Configuration Hints: BSD Variants

Performance tuning

By default, earlier BSD versions use very modest buffer sizes and don't enable Window Scaling by default. See the references below how to do so.

In contrast, FreeBSD 7.0 introduced TCP buffer auto-tuning, and thus should provide good TCP performance out of the box even over LFNs. This release also implements large-send offload (LSO) and large-receive offload (LRO) for some Ethernet adapters. FreeBSD 7.0 also announces the following in its presentation of technological advances:

10Gbps network optimization: With optimized device drivers from all major 10gbps network vendors, FreeBSD 7.0 has seen extensive optimization of the network stack for high performance workloads, including auto-scaling socket buffers, TCP Segment Offload (TSO), Large Receive Offload (LRO), direct network stack dispatch, and load balancing of TCP/IP workloads over multiple CPUs on supporting 10gbps cards or when multiple network interfaces are in use simultaneously. Full vendor support is available from Chelsio, Intel, Myricom, and Neterion.

Recent TCP Work

The FreeBSD Foundation has granted a project "Improvements to the FreeBSD TCP Stack" to Lawrence Stewart at Swinburne University. Goals for this project include support for Appropriate Byte Counting (ABC, RFC 3465), merging SIFTR into the FreeBSD codebase, and improving the implementation of the reassembly queue. Information is available on http://caia.swin.edu.au/urp/newtcp/. Other improvements done by this group are support for modular congestion control, implementations of CUBIC, H-TCP, TCP Vegas, the SIFTR TCP implementation tool, and a testing framework including improvements to iperf for better buffer size control.

The CUBIC implementation for FreeBSD was announced on the ICCRG mailing list in September 2009. Currently available as patches for the 7.0 and 8.0 kernels, it is planned to be merged "into the mainline FreeBSD source tree in the not too distant future".

In February 2010, a set of Software for FreeBSD TCP R&D was announced by the Swinburne group: This includes modular TCP congestion control, Hamilton TCP (H-TCP), a newer "Hamilton Delay" Congestion Control Algorithm v0.1, Vegas Congestion Control Algorithm v0.1, as well as a kernel helper/hook framework ("Khelp") and a module (ERTT) to improve RTT estimation.

Another release in August 2010 added a "CAIA-Hamilton Delay" congestion control algorithm as well as revised versions of the other components.

QoS tools

On BSD systems ALTQ implements a couple of queueing/scheduling algorithms for network interfaces, as well as some other QoS mechanisms.

To use ALTQ on a FreeBSD 5.x or 6.x box, the following are the necessary steps:

  1. build a kernel with ALTQ
    • options ALTQ and some others beginning with ALTQ_ should be added to the kernel config. Please refer to the ALTQ(4) man page.
  2. define QoS settings in /etc/pf.conf
  3. use pfctl to apply those settings
    • Set pf_enable to YES in /etc/rc.conf (set as well the other variables related to pf according to your needs) in order to apply the QoS settings every time the host boots.

References

-- SimonLeinen - 27 Jan 2005 - 18 Aug 2010
-- PekkaSavola - 17 Nov 2006

Linux-Specific Network Performance Tuning Hints

Linux has its own implementation of the TCP/IP Stack. With recent kernel versions, the TCP/IP implementation contains many useful performance features. Parameters can be controlled via the /proc interface, or using the sysctl mechanism. Note that although some of these parameters have ipv4 in their names, they apply equally to TCP over IPv6.

A typical configuration for high TCP throughput over paths with high bandwidth*delay product would include the following in /etc/sysctl.conf:

A description of each parameter listed below can be found in section Linux IP Parameters.

Basic tuning

TCP Socket Buffer Tuning

See the EndSystemTcpBufferSizing topic for general information about sizing TCP buffers.

Since 2.6.17 kernel, buffers have sensible automatically calculated values for most uses. Unless very high RTT, loss or performance requirement (200+ Mbit/s) is present, buffer settings may not need to be tuned at all.

Nonetheless, the following values may be used:

net/core/rmem_max=16777216
net/core/wmem_max=16777216
net/ipv4/tcp_rmem="8192 87380 16777216"
net/ipv4/tcp_wmem="8192 65536 16777216"

With kernel < 2.4.27 or < 2.6.7, receive-side autotuning may not be implemented, and the default (middle value) should be increased (at the cost of higher, by-default memory consumption):

net/ipv4/tcp_rmem="8192 16777216 16777216"

NOTE: If you have a server with hundreds of connections, you might not want to use a large default value for TCP buffers, as memory may quickly run out smile

There is a subtle but important implementation detail in the socket buffer management of Linux. When setting either the send- or receive buffer sizes via the SO_SNDBUF and SO_RCVBUF socket options via setsockopt(2), the value passed in the system call is doubled by the kernel to accomodate buffer management overhead. Reading the values back with getsockopt(2) return this modified value, but the effective buffer available to TCP payload is still the original value.

The values net/core/rmem and net/core/wmem apply to the argument to setsockopt(2).

In contrast, the maximum values of net/ipv4/tcp_rmem=/=net/ipv4/tcp_wmem apply to the total buffer sizes including the factor of 2 for the buffer management overhead. As a consequence, those values must be chosen twice as large as required by a particular BandwidthDelayProduct. Also note taht the values net/core/rmem and net/core/wmem do not apply to the TCP autotuning mechanism.

Interface queue lengths

InterfaceQueueLength describes how to adjust interface transmit and receive queue lengths. This tuning is typically needed with GE or 10GE transfers.

Host/adapter architecture implications

When going for 300 Mbit/s performance, it is worth verifying that host architecture (e.g., PCI bus) is fast enough. PCI Express is usually fast enough to no longer be the bottleneck in 1Gb/s and even 10Gb/s applications.

For the older PCI/PCI-X buses, when going for 2+ Gbit/s performance, the Maximum Memory Read Byte Count (MMRBC) usually needs to be increased using setpci.

Many network adapters support features such as checksum offload. In some cases, however, these may even decrease performance. In particular, TCP Segment Offload may need to be disabled, with:

ethtool -K eth0 tso off

Advanced tuning

Sharing congestion information across connections/hosts

2.4 series kernels have a TCP/IP weakness in that their interface buffers' maximum window size is based on the experience of previous connections - if you have loss at any point (or a bad end host at the same route) you limit your future TCP connections. So, you may have to flush the route cache to improve performance.

net.ipv4.route.flush=1

2.6 kernels also remember some performance characteristics across connections. In benchmarks and other tests, this might not be desirable.

# don't cache ssthresh from previous connection
net.ipv4.tcp_no_metrics_save=1

Other TCP performance variables

If there is packet reordering in the network, reordering could end up being interpreted as a packet loss too easily. Increasing tcp_reordering parameter might help in that case:

net/ipv4/tcp_reordering=20   # (default=3)

Several variables already have good default values, but it may make sense to check that these defaults haven't been changed:

net/ipv4/tcp_timestamps=1
net/ipv4/tcp_window_scaling=1
net/ipv4/tcp_sack=1
net/ipv4/tcp_moderate_rcvbuf=1

TCP Congestion Control algorithms

Linux 2.6.13 introduced pluggable congestion modules, which allows you to select one of the high-speed TCP congestion control variants, e.g. CUBIC

net/ipv4/tcp_congestion_control = cubic

Alternative values include highspeed (HS-TCP), scalable (Scalable TCP), htcp (Hamilton TCP), bic (BIC), reno ("Reno" TCP), and westwood (TCP Westwood).

Note that on Linux 2.6.19 and later, CUBIC is already used as the default algorithm.

Web100 kernel tuning

If you are using a web100 kernel, the following parameters seem to improve networking performance even further:

# web100 tuning
# turn off using txqueuelen as part of congestion window computation
net/ipv4/WAD_IFQ = 1

QoS tools

Modern Linux kernels have flexible traffic shaping built in.

See the Linux traffic shaping example for an illustration of how these mechanisms can be used to solve a real performance problem.

References

See also the reference to Congestion Control in Linux in the FlowControl topic.

-- SimonLeinen - 27 Oct 2004 - 23 Jul 2009
-- ChrisWelti - 27 Jan 2005
-- PekkaSavola - 17 Nov 2006
-- AlexGall - 28 Nov 2016

Changed:
<
<
Warning: Can't find topic PERTKB.TxQueueLength
>
>

Interface queue lengths

txqueuelen

The txqueuelen parameter of an interface in the Linux kernel. It limits the number of packets in the transmission queue in the interface's device driver. In 2.6 series and e.g., RHEL3 (2.4.21) kernel, the default is 1000. In earlier kernels, the default is 100. '100' is often too low to support line-rate transfers over Gigabit Ethernet interfaces, and in some cases, even '1000' is too low.

For Gigabit Ethernet interfaces, it is suggested to use at least a txqueuelen of 1000. (values of up to 8000 have been used successfully to further improve performance), e.g.,

ifconfig eth0 txqueuelen 1000

If a host is low performance or has slow links, having too big txqueuelen may disturb interactive performance.

netdev_max_backlog

The sysctl net.core.netdev_max_backlog defines the queue sizes for received packets. In recent kernels (like 2.6.18), the default is 1000; in older ones, it is 300. If the interface receives packets (e.g., in a burst) faster than the kernel can process them, this could overflow. A value in the order of thousands should be reasonable for GE, tens of thousands for 10GE.

For example,

net/core/netdev_max_backlog=2500

References

-- SimonLeinen - 06 Jan 2005
-- PekkaSavola - 17 Nov 2006

 

OS-Specific Configuration Hints: Mac OS X

As Mac OS X is mainly a BSD derivative, you can use similar mechanisms to tune the TCP stack - see under BsdOSSpecific.

TCP Socket Buffer Tuning

See the EndSystemTcpBufferSizing topic for general information about sizing TCP buffers.

For testing temporary improvements, you can directly use sysctl in a terminal window: (you have to be root to do that)

sysctl -w kern.ipc.maxsockbuf=8388608
sysctl -w net.inet.tcp.rfc1323=1
sysctl -w net.inet.tcp.sendspace=1048576
sysctl -w net.inet.tcp.recvspace=1048576
sysctl -w kern.maxfiles=65536
sysctl -w net.inet.udp.recvspace=147456
sysctl -w net.inet.udp.maxdgram=57344
sysctl -w net.local.stream.recvspace=65535
sysctl -w net.local.stream.sendspace=65535

For permanent changes that last over a reboot, insert the appropriate configurations into Ltt>/etc/sysctl.conf. If this file does not exist must create it. So, for the above, just add the following lines to sysctl.conf:

kern.ipc.maxsockbuf=8388608
net.inet.tcp.rfc1323=1
net.inet.tcp.sendspace=1048576
net.inet.tcp.recvspace=1048576
kern.maxfiles=65536
net.inet.udp.recvspace=147456
net.inet.udp.maxdgram=57344
net.local.stream.recvspace=65535
net.local.stream.sendspace=65535

Note This only works for OSX 10.3 or later! For earlier versions you need to use /etc/rc where you can enter whole sysctl commands.

Users that are unfamiliar with terminal windows can also use the GUI tool "TinkerTool System" and use its Network Tuning option to set the TCP buffers.

TinkerTool System is available from:

-- ChrisWelti - 30 Jun 2005

OS-Specific Configuration Hints: Solaris (Sun Microsystems)

Planned Features

Pluggable Congestion Control for TCP and SCTP

A proposed OpenSolaris project foresees the implementation of pluggable congestion control for both TCP and SCTP. HS-TCP and several other congestion control algorithms for OpenSolaris. This includes implementation of the HighSpeed, CUBIC, Westwood+, and Vegas congestion control algorithms, as well as ipadm subcommands and socket options to get and set congestion control parameters.

On 15 December 2009, Artem Kachitchkine posted an initial draft of a work-in-progress design specification for this feature has been announced on the OpenSolaris networking-discuss forum. According to this proposal, the API for setting the congestion control mechanism for a specific TCP socket will be compatible with Linux: There will be a TCP_CONGESTION socket option to set and retrieve a socket's congestion control algorithm, as well as a TCP_INFO socket option to retrieve various kinds of information about the current congestion control state.

The entire plugabble-congestion control mechanism will be implemented for SCTP in addition to TCP. For example, there will also be an SCTP_CONGESTION socket option. Note that congestion control in SCTP is somewhat trickier than in TCP, because a single SCTP socket can have multiple underlying paths through SCTP's "multi-homing" feature. Congestion control state must be kept separately for each path (address pair). This also means that there is no direct SCTP equivalent to TCP_INFO. The current proposal adds a subset of TCP_INFO's information to the result of the existing SCTP_GET_PEER_ADDR_INFO option for getsockopt().

The internal structure of the code will be somewhat different to what is in the Linux kernel. In particular, the general TCP code will only make calls to the algorithm-specific congestion control modules, not vice versa. The proposed Solaris mechanism also contains ipadm properties that can be used to set the default congestion control algorithm either globally or for a specific zone. The proposal also suggests "observability" features; for example, pfiles output should include the congestion algorithm used for a socket, and there are new kstat statistics that count certain congestion-control events.

Useful Features

TCP Multidata Transmit (MDT, aka LSO)

Solaris 10, and Solaris 9 with patches, supports TCP Multidata Transmit (MDT), which is Sun's name for (software-only) Large Send Offload (LSO). In Solaris 10, this is enabled by default, but in Solaris 9 (with the required patches for MDT support), the kernel and driver have to be reconfigured to be able to use MDT. See the following pointers for more information from docs.sun.com:

Solaris 10 "FireEngine"

The TCP/IP stack in Solaris 10 has been largely rewritten from previous versions, mostly to improve performance. In particular, it supports Interrupt Coalescence, integrates TCP and IP more closely in the kernel, and provides multiprocessing enhancements to distribute work more efficiently over multiple processors. Ongoing work includes UDP/IP integration for better performance of UDP applications, and a new driver architecture that can make use of flow classification capabilities in modern network adapters.

Solaris 10: New Network Device Driver Architecture

Solaris 10 introduces GLDv3 (project "Nemo"), a new driver architecture that generally improves performance, and adds support for several performance features. Some, but not all, Ethernet device drivers were ported over to the new architecture and benefit from those improvements. Notably, the bge driver was ported early, and the new "Neptune" adapters ("multithreaded" dual-port 10GE and four-port GigE with on-board connection de-multiplexing hardware) used it from the start.

Darren Reed has posted a small C program that lists the active acceleration features for a given interface. Here's some sample output:

$ sudo ./ifcapability
lo0 inet
bge0 inet +HCKSUM(version=1 +full +ipv4hdr) +ZEROCOPY(version=1 flags=0x1) +POLL
lo0 inet6
bge0 inet6 +HCKSUM(version=1 +full +ipv4hdr) +ZEROCOPY(version=1 flags=0x1)

Displaying and setting link parameters with dladm

Another OpenSolaris project called Brussels unifies many aspects of network driver configuration under the dladm command. For example, link MTUs (for "Jumbo Frames") can be configured using

dladm set-linkprop -p mtu=9000 web1

The command can also be used to look at current physical parameters of interfaces:

$ sudo dladm show-phys
LINK         MEDIA                STATE      SPEED DUPLEX   DEVICE
bge0         Ethernet             up         1000 full      bge0

Note that Brussels is still being integrated into Solaris. Driver support was added since SXCE (Solaris Express Community Edition) build 83 for some types of adapters. Eventually this should be integrated into regular Solaris releases.

Setting TCP buffers

# To increase the maximum tcp window
# Rule-of-thumb: max_buf = 2 x cwnd_max (congestion window)
ndd -set /dev/tcp tcp_max_buf 4194304
ndd -set /dev/tcp tcp_cwnd_max 2097152

# To increase the DEFAULT tcp window size
ndd -set /dev/tcp tcp_xmit_hiwat 65536
ndd -set /dev/tcp tcp_recv_hiwat 65536

Pitfall when using asymmetric send and receive buffers

The documented default behaviour (tunable TCP parameter tcp_wscale_always = 0) of Solaris is to include the TCP window scaling option in an initial SYN packet when either the send or the receive buffer is larger than 64KiB. From the tcp(7P) man page:

          For all applications, use ndd(1M) to modify the  confi-
          guration      parameter      tcp_wscale_always.      If
          tcp_wscale_always is set to 1, the window scale  option
          will  always be set when connecting to a remote system.
          If tcp_wscale_always is 0, the window scale option will
          be set only if the user has requested a send or receive
          window  larger  than  64K.   The   default   value   of
          tcp_wscale_always is 0.

However, Solaris 8, 9 and 10 do not take the send window into account. This results in an unexpected behaviour for a bulk transfer from node A to node B when the bandwidth-delay product is larger than 64KiB and

  • A's receive buffer (tcp_recv_hiwat) < 64KiB
  • A's send buffer (tcp_xmit_hiwat) > 64KiB
  • B's receive buffer > 64KiB

A will not advertize the window scaling option and B will not do so either according to RFC 1323. As a consequence, throughput will be limited by a congestion window of 64KiB.

As a workaround, the window scaling option can be forcibly advertised by setting

# ndd -set /dev/tcp tcp_wscale_always 1

A bug report has been filed with Sun Microsystems.

References

-- ChrisWelti - 11 Oct 2005, added section for setting default and maximum TCP buffers
-- AlexGall - 26 Aug 2005, added section on pitfall with asymmetric buffers
-- SimonLeinen - 27 Jan 2005 - 16 Dec 2009

Windows-Specific Host Tuning

Next Generation TCP/IP Stack in Windows Vista / Windows Server 2008 / Windows 7

According to http://www.microsoft.com/technet/community/columns/cableguy/cg0905.mspx, the new version of Windows, Vista, features a redesigned TCP/IP stack. Besides unifying IPv4/IPv6 dual-stack support, this new stack also claims much-improved performance for high-speed and asymmetric networks, as well as several auto-tuning features.

The Microsoft Windows Vista Operating System enables the TCP Window Scaling option by default (previous Windows OSes had this option disabled). This causes problems with various middleboxes, see WindowScalingProblems. As a consequence, the scaling factor is limited to 2 for HTTP traffic in Vista.

Another new feature of Vista is Compound TCP, a high-performance TCP variant that uses delay information to adapt its transmission rate (in addition to loss information). On Windows Server 2008 it is enabled by default, but it is disabled in Vista. You can enable it in Vista by running the following command as an Administrator on the command line.

netsh interface tcp set global congestionprovider=ctcp
(see CompoundTCP for more information).

Although the new TCP/IP-Stack has integrated auto-tuning for receive buffers, the TCP send buffer still seems to be limited to 16KB by default. This means you will not be able to get good upload rates for connections with higher RTTs. However, using the registry hack (see further below) you can manually change the default value to something higher.

Performance tuning for earlier Windows versions

Enabling TCP Window Scaling

The references below detail the various "whys" and "hows" to tune Windows network performance. Of particular note however is the setting of scalable TCP receive windows (see LargeTcpWindows).

It appears that, by default, not only does Windows not support TCP 1323 scalable windows, but the required key is not even in the Windows registry. The key (Tcp1323Opts) can be added to one of two places, depending on the particular system:

[HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\VxD\MSTCP] (Windows'95)
[HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\Tcpip\Parameters] (Windows 2000/XP)

Value Name: Tcp1323Opts
Data Type: REG_DWORD (DWORD Value)
Value Data: 0, 1, 2 or 3

  • 0 = disable RFC 1323 options
  • 1 = window scale enabled only
  • 2 = time stamps enabled only
  • 3 = both options enabled

This can also be adjusted using DrTCP GUI tool (see link below)

TCP Buffer Sizes

See the EndSystemTcpBufferSizing topic for general information about sizing TCP buffers.

Buffers for TCP Send Windows

Inquiry at Microsoft (thanks to Larry Dunn) has revealed that the default send window is 8KB and that there is no official support for configuring a system-wide default. However, the current Winsock implementation uses the following undocumented registry key for this purpose

[HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\AFD\Parameters]

Value Name: DefaultSendWindow
Data Type: REG_DWORD (DWORD Value)
Value Data: The window size in bytes. The maximum value is unknown.

According to Microsoft, this parameter may not be supported in future Winsock releases. However, this parameter is confirmed to be working with Windows XP, Windows 2000, Windows Vista and even the Windows 7 Beta 1.

This value needs to be manually adjusted (e.g., DrTCP can't do it), or the application needs to set it (e.g., iperf with '-l' option). It may be difficult to detect this as a bottleneck because the host will advertise large windows but will only have about 8.5KB of data in flight.

Buffers for TCP Receive Windows

(From this TechNet article

Assuming Window Scaling is enabled, then the following parameter can be set to between 1 and 1073741823 bytes

[HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\Tcpip\Parameters\TcpWindowSize]

Value Name: TcpWindowSize
Data Type: REG_DWORD (DWORD Value)
Value Data: The window size in bytes. 1 to 65535 (or 1 to 1073741823 if Window scaling set).

DrTCP can also be used to adjust this parameter.

Windows XP SP2/SP3 simultaneous connection limit

Windows XP SP2 and SP3 limits the number of simultaneous TCP connection attempts (so called half-open state) to 10 (by default). How exactly it's doing this is not clear, but this may come up with some kind of applications. Especially file sharing applications like bittorrent might suffer from this behaviour. You can see yourself if your performance is affected by looking into your event logs:

EventID 4226: TCP/IP has reached the security limit imposed on the number of concurrent TCP connect attempts

A patcher has been developed to adjust this limit.

References

-- TobyRodwell - 28 Feb 2005 (initial version)
-- SimonLeinen - 06 Apr 2005 (removed dead pointers; added new ones; descriptions)
-- AlexGall - 17 Jun 2005 (added send window information)
-- HankNussbacher - 19 Jun 2005 (added new MS webcast for 2003 Server)
-- SimonLeinen - 12 Sep 2005 (added information and pointer about new TCP/IP in Vista)
-- HankNussbacher - 22 Oct 2006 (added Cisco IOS firewall upgrade)
-- AlexGall - 31 Aug 2007 (replaced IOS issue with a link to WindowScalingProblems, added info for scaling limitation in Vista)
-- SimonLeinen - 04 Nov 2007 (information on how to enable Compound TCP on Vista)
-- PekkaSavola - 05 Jun 2008 major reorganization to improve the clarity wrt send buffer adjustment, add a link to DrTCP; add XP SP3 connection limiting discussion -- ChrisWelti - 02 Feb 2009 added new information regarding the send window in windows vista / server 2008 / 7

Revision 222006-07-06 - SimonLeinen

Line: 1 to 1
 
META TOPICPARENT name="AutoToc"

Unrolled PERTKB Contents

Line: 86 to 86
 

Netflow-based Tools

Netflow: Analysis tools to characterize the traffic in the network, detect routing problems, security problems (Denials of Service), etc.

There are different versions of NDE (NetFlow Data Export): v1 (first), v5, v7, v8 and the last version (v9) which is based on templates. This last version makes it possible to analyze IPv6, MPLS and Multicast flows.

  • NDE v9 description:
    NDE v9 description

References

-- FrancoisXavierAndreu - 2005-06-06 - 2006-04-07
-- SimonLeinen - 2006-01-05 - 2015-07-18

Packet Capture and Analysis Tools:

Detect protocol problems via the analysis of packets, trouble shooting

  • tcpdump: Packet capture tool based on the libpcap library. http://www.tcpdump.org/
  • snoop: tcpdump equivalent in Sun's Solaris operating system
  • Wireshark (formerly called Ethereal): Extended tcpdump with a user-friendly GUI. Its plugin architecture allowed many protocol-specific "dissectors" to be written. Also includes some analysis tools useful for performance debugging.
  • libtrace: A library for trace processing; works with libpcap (tcpdump) and other formats.
  • Netdude: The Network Dump data Displayer and Editor is a framework for inspection, analysis and manipulation of tcpdump trace files.
  • jnettop: Captures traffic coming across the host it is running on and displays streams sorted by the bandwidth they use.
  • tcptrace: a statistics/analysis tool for TCP sessions using tcpdump capture files
  • capturing packets with Endace DAG cards (dagsnap, dagconvert, ...)

General Hints for Taking Packet Traces

Capture enough data - you can always throw away stuff later. Note that tcpdump's default capture length is small (96 bytes), so use -s 0 (or something like -s 1540) if you are interested in payloads. Wireshark and snoop capture entire packets by default. Seemingly unrelated traffic can impact performance. For example, Web pages from foo.example.com may load slowly because of the images from adserver.example.net. In situations of high background traffic, it may however be necessary to filter out unrelated traffic.

It can be extremely useful to collect packet traces from multiple points in the network

  • On the endpoints of the communication
  • Near �suspicious� intermediate points such as firewalls

Synchronized clocks (e.g. by NTP) are very useful for matching traces.

Address-to-name resolution can slow down display and causes additional traffic which confuses the trace. With tcpdump, consider using -n or tracing to file (-w file).

Request remote packet traces

When a packet trace from a remote site is required, this often means having to ask someone at that site to provide it. When requesting such a trace, consider making this as easy as possible for the person having to do it. Try to use a packet tracing tool that is already available - tcpdump for most BSD or Linux systems, snoop for Solaris machines. Windows doesn't seem to come bundled with a packet capturing program, but you can direct the user to Wireshark, which is reasonably easy to install and use under Windows. Try to give clean indications on how to call the packet capture program. It is usually best to ask the user to capture to a file, and then send you the capture file as an e-mail attachment or so.

References

  • Presentation slides from a short talk about packet capturing techniques given at the network performance section of the January 2006 GEANT2 technical workshop

-- FrancoisXavierAndreu - 06 Jun 2005
-- SimonLeinen - 05 Jan 2006-09 Apr 2006
-- PekkaSavola - 26 Oct 2006

tcpdump

One of the early diagnostic tools for TCP/IP that was written by Van Jacobson, Craig Leres, and Steven McCanne. Tcpdump can be used to capture and decode packets in real time, or to capture packets to files (in "libpcap" format, see below), and analyze (decode) them later.

There are now more elaborate and, in some ways, user-friendly packet capturing programs, such as Wireshark (formerly called Ethereal), but tcpdump is widely available, widely used, so it is very useful to know how to use it.

Tcpdump/libpcap is still actively being maintained, although not by its original authors.

libpcap

Libpcap is a library that is used by tcpdump, and also names a file format for packet traces. This file format - usually used in files with the extension .pcap - is widely supported by packet capture and analysis tools.

Selected options

Some useful options to tcpdump include:

  • -s snaplen capture snaplen bytes of each frame. By default, tcpdump captures only the first 68 bytes, which is sufficient to capture IP/UDP/TCP/ICMP headers, but usually not payload or higher-level protocols. If you are interested in more than just headers, use -s 0 to capture packets without truncation.
  • -r filename read from an previously created capture file (see -w)
  • -w filename dump to a file instead of analyzing on-the-fly
  • -i interface capture on an interface other than the default (first "up" non-loopback interface). Under Linux, -i any can be used to capture on all interfaces, albeit with some restrictions.
  • -c count stop the capture after count packets
  • -n don't resolve addresses, port numbers, etc. to symbolic names - this avoids additional DNS traffic when analyzing a live capture.
  • -v verbose output
  • -vv more verbose output
  • -vvv even more verbose output

Also, a pflang expression can be appended to the command so as to filter the captured packets. An expression is made up of one or more of "type", "direction" and "protocol".

  • type Can be host, net or port. host is presumed unless otherwise speciified
  • dir Can be src, dst, src or dst or src and dst
  • proto (for protocol) Common types are ether, ip, tcp, udp, arp ... If none is specifiied then all protocols for which the value is valid are considered.
Example expressions:
  • dst host <address>
  • src host <address>
  • udp dst port <number>
  • host <host> and not port ftp and not port ftp-data

Usage examples

Capture a single (-c 1) udp packet to file test.pcap:

: root@diotima[tmp]; tcpdump -c 1 -w test.pcap udp
tcpdump: listening on bge0, link-type EN10MB (Ethernet), capture size 96 bytes
1 packets captured
3 packets received by filter
0 packets dropped by kernel

This produces a binary file containing the captured packet as well as a small file header and a timestamp:

: root@diotima[tmp]; ls -l test.pcap
-rw-r--r-- 1 root root 114 2006-04-09 18:57 test.pcap
: root@diotima[tmp]; file test.pcap
test.pcap: tcpdump capture file (big-endian) - version 2.4 (Ethernet, capture length 96)

Analyze the contents of the previously created capture file:

: root@diotima[tmp]; tcpdump -r test.pcap
reading from file test.pcap, link-type EN10MB (Ethernet)
18:57:28.732789 2001:630:241:204:211:43ff:fee1:9fe0.32832 > ff3e::beac.10000: UDP, length: 12

Display the same capture file in verbose mode:

: root@diotima[tmp]; tcpdump -v -r test.pcap
reading from file test.pcap, link-type EN10MB (Ethernet)
18:57:28.732789 2001:630:241:204:211:43ff:fee1:9fe0.32832 > ff3e::beac.10000: [udp sum ok] UDP, length: 12 (len 20, hlim 118)

More examples with some advanced tcpdump use cases.

References

-- SimonLeinen - 2006-03-04 - 2016-03-17

Changed:
<
<
Warning: Can't find topic PERTKB.EthEreal
>
>

Wireshark®

Wireshark is a packet capture/analysis tool, similar to tcpdump but much more elaborate. It has a graphical user interface (GUI) which allows "drilling down" into the header structure of captured packets. In addition, it has a "plugin" architecture that allows decoders ("dissectors" in Wireshark terminology) to be written with relative ease. This and the general user-friendliness of the tool has resulted in Wireshark supporting an abundance of network protocols - presumably writing Wireshark dissectors is often part of the work of developing/implementing new protocols. Lastly, Wireshark includes some nice graphical analysis/statictics tools such as much of the functionality of tcptrace and xplot.

One of the main attractions of Wireshark is that it works nicely under Microsoft Windows, although it requires a third-party library to implement the equivalent of the libpcap packet capture library.

Wireshark used to be called Ethereal™, but was renamed in June 2006, when its principal maintainer changed employers. Version 1.8 adds support for capturing on multiple interfaces in parallel, simplified management of decryption keys (for 802.11 WLANs and IPsec/ISAKMP), and geolocation for IPv6 addresses. Some TCP events (fast retransmits and TCP Window updates) are no longer flagged as warnings/errors. The default format for saving capture files is changing from pcap to pcap-ng. Wireshark 1.10, announced in June 2013, adds many features including Windows 8 support and response-time analysis for HTTP requests.

Usage examples

The following screenshot shows Ethereal 0.10.14 under Linux/Gnome when called as ethereal -r test.pcap, reading the single-packet example trace generated in the tcpdump example. The "data" portion of the UDP part of the packet has been selected by clicking on it in the middle pane, and the corresponding bytes are highlighted in the lower pane.

ethereal -r test.pcap screendump

tshark

The package includes a command-line tool called tshark, which can be used in a similar (but not quite compatible) way to tcpdump. Through complex command-line options, it can give access to some of the more advanced decoding functionality of Wireshark. Because it generates text, it can be used as part of analysis scripts.

Scripting

Wireshark can be extended using scripting languages. The Lua language has been supported for several years. Version 1.4.0 (released in September 2010) added preliminary support for Python as an extension language.

CloudShark

In a cute and possibly even useful application of tshark, QA Cafe (an IP testing solutions vendor) has put up "Wireshark as a Service" under www.cloudshark.org. This tool lets users upload packet dumps without registration, and provides the familiar Wireshark interface over the Web. Uploads are limited to 512 Kilobytes, and there are no guarantees about confidentiality of the data, so it should not be used on privacy-sensitive data.

References

-- SimonLeinen - 2006-03-04 - 2013-06-09

 

libtrace

libtrace is a library for trace processing. It supports multiple input methods, including device capture, raw and gz-compressed trace, and sockets; and mulitple input formats, including pcap and DAG.

References

-- SimonLeinen - 04 Mar 2006

Netdude

Netdude (Network Dump data Displayer and Editor) is a framework for inspection, analysis and manipulation of tcpdump (libpcap) trace files.

References

-- SimonLeinen - 04 Mar 2006

jnettop

Captures traffic coming across the host it is running on and displays streams sorted by the bandwidth they use. The result is a nice listing of communication on network grouped by stream, which shows transported bytes and consumed bandwidth.

References

-- FrancoisXavierAndreu - 06 Jun 2005 -- SimonLeinen - 04 Mar 2006

Revision 212006-04-18 - RkyilxhNhubfesl

Line: 1 to 1
 
META TOPICPARENT name="AutoToc"

Unrolled PERTKB Contents

Welcome to the eduPERT Knowledge Base!!!!

_If you'd like to know more about us and what the eduPERT community is up to, visit our portal.

This is a wiki that collects useful information on how to get good performance out of networks, in particular research networks.

This KB is open and public and anybody can benefit and contribute to its content. In the past years many topics have been collected with regards to performance and general knowledge of the network. To make the navigation easier they have been grouped in 5 main categories:

network.pngNETWORK: Network protocols, tuning and more...

endhost.pngEND HOST: Application protocols, End-Host tuning and more...

new_tools.pngTOOLS: Active and passive measurement tools, traceroute and more...

knowledge.pngGENERAL KNOWLEDGE: Wizard gap, performance people, evil middlebox and more...

performance.pngPERFORMANCE CASE STUDIES: History of the PERT cases (solved or still open).

If you're looking for topics to work on, check out the ToDo list! You can also add suggestions for improvement there.

Much of the material written here was published in August 2006 as GN2-06-135v2 (DS3.3.3): PERT Performance Guides.

Note that you can subscribe to update notifications to this Knowledge Base through an RSS Feed



How to report a performance problem

Site Tools of the PERTKB Web

Find Topics in KB:
Result:

Latest News

Windows 7 / Vista / 2008 TCP/IP tweaks

On Speedguide.net there is a great article about the TCP/IP Stack tuning possibilities in Windows 7/Vista/2008.

NDT Protocol Documented

A description of the protocol used by the NDT tool has recently been written up. Check this out if you want to implement a new client (or server), or if you simply want to understand the mechanisms.

-- SimonLeinen - 02 Aug 2011

What does the "+1" mean?

I would like to experiment with social web techniques, so I have added a (Google) "+1" button to the template for viewing PERT KB topics. It allows people to publicly share interesting Web pages. Please share what you think about this via the SocialWebExperiment topic!

-- SimonLeinen - 02 Jul 2011

News from Web10G

The Web10G project has released its first code in May 2011. If you are interested in this followup project of the incredibly useful web100, check out their site. I have added a short summary on the project to the WebTenG topic here.

-- SimonLeinen - 30 Jun 2011

FCC Open Internet Apps Challenge - Vote for Measurement Tools by 15 July

The U.S. Federal Communications Commission (FCC) is currently holding a challenge for applications that can help assess the "openness" of Internet access offerings. There's a page on challenge.gov where people can vote for their favorite tool. A couple of candidate applications are familiar to our community, for example NDT and NPAD. But there are some new ones that look interesting. Some applications are specific for mobile data networks ("3G" etc.). There are eight submissions - be sure to check them out. Voting is possible until 15 July 2011.

-- SimonLeinen - 29 Jun 2011

Web10G

Catching up with the discussion@web100.org mailing list, I noticed an announcement by Matt Mathis (formerly PSC, now at Google) from last September (yes, I'm a bit behind with my mail reading :-): A follow-on project called Web10G has received funding from the NSF. I'm all excited about this because Web100 produced so many influential results. So congratulations to the proposers, good luck for the new project, and let's see what Web10G will bring!

-- SimonLeinen - 25 Mar 2011

eduPERT Training Workshop in Zurich, 18/19 November 2010

We held another successful training event in November, with 17 attendees from various NRENs including some countries that have joined GÉANT recently, or are in the process of doing so. The program was based on the one used for the workshops held in 2008 and 2007. But this time we added some case studies, where the instructors walked through some actual cases they had encountered in their PERT work, and at many points asked the audience for their interpretation of the symptoms seen so far, and for suggestions on what to do next. Thanks to the active participation of several attendees, this interactive part turned out to be both enjoyable and instructive, and was well received by the audience. Program and slides are available on the PERT Training Workshop page at TERENA.

-- SimonLeinen - 18 Dec 2010

Efforts to increase TCP's initial congestion window (updated)

At the 78th IETF meeting in Maastricht this July, one of the many topics discussed there was a proposal to further increase TCP's initial congestion window, possibly to something as big as ten maximum-size segments (10 MSS). This caused me to notice that the PERT KB was missing a topic on TCP's initial congestion window, so I created one. A presentation by Matt Mathis on the topic is scheduled for Friday's ICCRG meeting. The slides, as well as some other materials and minutes, can be found on the IETF Proceedings page. Edited to add: Around IETF 79 in November 2010, the discussion on the initial window picked up steam, and there are now three different proposals in the form of Internet-Drafts. I have updated the TcpInitialWindow topic to briefly describe these.

-- SimonLeinen - 28 Jul 2010 - 09 Dec 2010

PERT meeting at TNC 2010

We held a PERT meeting on Sunday 30 May (1400-1730), just before the official start of the TNC 2010 conference in Vilnius. The attendees discussed technical and organizational issues facing PERTs in National Research and Education Networks (NRENs), large campus networks, and supporting network-intensive projects such as the multi-national research efforts in (Radio-) Astronomy, Particle Physics, or Grid/Cloud research. Slides are available on TNC2010 website here.

-- SimonLeinen - 15 Jun 2010

Web100 Project Re-Opening Soon?

In a mail to the ndt-users list, Matt Mathis has issued a call for input from people who use Web100, as part of NDT or otherwise. If you use this, please send Matt an e-mail. There's a possibility that the project will be re-opened "for additional development and some bug fixes".

-- SimonLeinen - 20 Feb 2010

ICSI releases Netalyzr Internet connectivity tester

The International Computer Science Institute (ICSI) at Berkeley has released an awesome new java applet called Netalyzr which will let you do a whole bunch of connectivity tests. A very nice tool to get information about your Internet connectivity, to see if your provider blocks ports, your DNS works properly, and much more. Read about it on our page here or go directly to their homepage.

-- ChrisWelti - 18 Jan 2010

News Archive:

Notes About Performance

This part of the PERT Knowledgebase contains general observations and hints about the performance of networked applications.

The mapping between network performance metrics and user-perceived performance is complex and often non-intuitive.

-- SimonLeinen - 13 Dec 2004 - 07 Apr 2006

Deleted:
<
<

Why Latency Is Important

Traditionally, the metric of focus in networking has been bandwidth. As more and more parts of the Internet have their capacity upgraded, bandwidth is often not the main problem anymore. However, network-induced latency as measured in One-Way Delay (OWD) or Round-Trip Time often has noticeable impact on performance.

It's not just for gamers...

The one group of Internet users today that is most aware of the importance of latency are online gamers. It is intuitively obvious than in real-time multiplayer games over the network, players don't want to be put at a disadvantage because their actions take longer to reach the game server than their opponents'.

However, latency impacts the other users of the Internet as well, probably much more than they are aware of. At a given bottleneck bandwidth, a connection with lower latency will reach its achievable rate faster than one with higher latency. The effect of this should not be underestimated, since most connections on the Internet are short - in particular connections associated with Web browsing.

But even for long connections, latency often has a big impact on performance (throughput), because many of those long connections have their throughput limited by the window size that is available for TCP. And when the window size is the bottleneck, throughput is inversely proportional to round-trip time. Furthermore, for a given (small) loss rate, RTT places an upper limit on the achieveable throughput of a TCP connection, as shown by the Mathis Equation

But I thought latency was only important for multimedia!?

Common wisdom is that latency (and also jitter) is important for audio/video ("multimedia") applications. This is only partly true: Many applications of audio/video involve "on-demand" unidirectional transmission. In those applications, the real-time concerns can often be mitigated by clever buffering or transmission ahead of time. For conversational audio/video, such as "Voice over IP" or videoconferencing, the latency issue is very real. The principal sources of latency in these applications are not backbone-related, but related to compression/sampling rates (see packetization delay) and to transcoding devices such as H.323 MCUs (Multi-Channel Units).

References

-- SimonLeinen - 13 Dec 2004

The "Wizard Gap"

The Wizard Gap is an expression coined by Matt Mathis (then PSC) in 1999. It designates the difference between the performance that is "theoretically" possible on today's high-speed networks (in particular, research networks), and the performance that most users actually perceive. The idea is that today, the "theoretical" performance can only be (approximately) obtained by "wizards" with superior knowledge and skills concerning system tuning. Good examples for "Wizard" communities are the participants in Internet2 Land-Speed Record or SC Bandwidth Challenge competitions.

The Internet2 end-to-end performance initiative strives to reduce the Wizard Gap by user education as well as improved instrumentation (see e.g. Web100) of networking stacks. In the GÉANT community, PERTs focus on assistance to users (case management), as well as user education through resources such as this knowledge base. Commercial players are also contributing to closing the wizard gap, by improving "out-of-the-box" performance of hardware and software, so that their customers can benefit from faster networking.

References

  • M. Mathis, Pushing up Performance for Everyone, Presentation at the NLANR/Internet2 Joint Techs Workshop, 1999 (PPT)

-- SimonLeinen - 2006-02-28 - 2016-04-27

 

User-Perceived Performance

User-perceived performance of network applications is made up of a number of mainly qualitative metrics, some of which are in conflict with each other. In the case of some applications, a single metric will outweigh the others, such as responsiveness from video services or throughput for bulk transfer applications. More commonly, a combination of factors usually determines the experience of network performance for the end-user.

Responsiveness

One of the most important user experiences in networking applications is the perception of responsiveness. If end-users feel that an application is slow, it is often the case that it is slow to respond to them, rather than being directly related to network speed. This is a particular issue for real-time applications such as audio/video conferencing systems and must be prioritised in applications such as remote medical services and off-campus teaching facilities. It can be difficult to quantitatively define an acceptable figure for response times as the requirements may vary from application to application.

However, some applications have relatively well-defined "physiological" bounds beyond which the responsiveness feeling vanishes. For example, for voice conversations, a (round-trip) delay of 150ms is practically unnoticeable, but an only slightly larger delay is typically felt as very intrusive.

Throughput/Capacity/"Bandwidth"

Throughput per se is not directly perceived by the user, although a lack of throughput will increase waiting times and reduce the impression of responsiveness. However, "bandwidth" is widely used as a "marketing" metric to differentiate "fast" connections from "slow" ones, and many applications display throughput during long data transfers. Therefore users often have specific performance expectations in terms of "bandwidth", and are disappointed when the actual throughput figures they see is significantly lower than the advertised capacity of their network connection.

Reliability

Reliability is often the most important performance criterion for a user: The application must be available when the user requires it. Note that this doesn't necessarily mean that it is always available, although there are some applications - such as the public Web presence of a global corporation - that have "24x7" availability requirements.

The impression of reliability will also be heavily influenced by what happens (or is expected to happen) in cases of unavailability: Does the user have a possibility to build a workaround? In case of provider problems: Does the user have someone competent to call - or can they even be sure that the provider will notice the problem themselves, and fix it in due time? How is the user informed during the outage, in particular concerning the estimated time to repair?

Another aspect of reliability is the predictability of performance. It can be profoundly disturbing to a user to see large performance variations over time, even if the varying performance is still within the required performance range - who can guarantee that variations won't increase beyond the tolerable during some other time when the application is needed? E.g., a 10 Mb/s throughput that remains rock-stable over time can feel more reliable than throughput figures that vary between 200 and 600 Mb/s.

-- SimonLeinen - 07 Apr 2006
-- AlessandraScicchitano - 10 Feb 2012

Responsiveness

One of the most important user experiences in networking applications is the perception of responsiveness. If end-users feel that an application is slow, it is often the case that it is slow to respond to them, rather than being directly related to network speed. This is a particular issue for real-time applications such as audio/video conferencing systems and must be prioritised in applications such as remote medical services and off-campus teaching facilities. It can be difficult to quantitatively define an acceptable figure for response times as the requirements may vary from application to application.

However, some applications have relatively well-defined "physiological" bounds beyond which the responsiveness feeling vanishes. For example, for voice conversations, a (round-trip) delay of 150ms is practically unnoticeable, but an only slightly larger delay is typically felt as very intrusive.

-- SimonLeinen - 07 Apr 2006 - 23 Aug 2007

Throughput/Capacity/"Bandwidth"

Throughput per se is not directly perceived by the user, although a lack of throughput will increase waiting times and reduce the impression of responsiveness. However, "bandwidth" is widely used as a "marketing" metric to differentiate "fast" connections from "slow" ones, and many applications display throughput during long data transfers. Therefore users often have specific performance expectations in terms of "bandwidth", and are disappointed when the actual throughput figures they see is significantly lower than the advertised capacity of their network connection.

-- SimonLeinen - 07 Apr 2006

Reliability

Reliability is often the most important performance criterion for a user: The application must be available when the user requires it. Note that this doesn't necessarily mean that it is always available, although there are some applications - such as the public Web presence of a global corporation - that have "24x7" availability requirements.

The impression of reliability will also be heavily influenced by what happens (or is expected to happen) in cases of unavailability: Does the user have a possibility to build a workaround? In case of provider problems: Does the user have someone competent to call - or can they even be sure that the provider will notice the problem themselves, and fix it in due time? How is the user informed during the outage, in particular concerning the estimated time to repair?

Another aspect of reliability is the predictability of performance. It can be profoundly disturbing to a user to see large performance variations over time, even if the varying performance is still within the required performance range - who can guarantee that variations won't increase beyond the tolerable during some other time when the application is needed? E.g., a 10 Mb/s throughput that remains rock-stable over time can feel more reliable than throughput figures that vary between 200 and 600 Mb/s.

-- SimonLeinen - 07 Apr 2006

Added:
>
>

The "Wizard Gap"

The Wizard Gap is an expression coined by Matt Mathis (then PSC) in 1999. It designates the difference between the performance that is "theoretically" possible on today's high-speed networks (in particular, research networks), and the performance that most users actually perceive. The idea is that today, the "theoretical" performance can only be (approximately) obtained by "wizards" with superior knowledge and skills concerning system tuning. Good examples for "Wizard" communities are the participants in Internet2 Land-Speed Record or SC Bandwidth Challenge competitions.

The Internet2 end-to-end performance initiative strives to reduce the Wizard Gap by user education as well as improved instrumentation (see e.g. Web100) of networking stacks. In the GÉANT community, PERTs focus on assistance to users (case management), as well as user education through resources such as this knowledge base. Commercial players are also contributing to closing the wizard gap, by improving "out-of-the-box" performance of hardware and software, so that their customers can benefit from faster networking.

References

  • M. Mathis, Pushing up Performance for Everyone, Presentation at the NLANR/Internet2 Joint Techs Workshop, 1999 (PPT)

-- SimonLeinen - 2006-02-28 - 2016-04-27

Why Latency Is Important

Traditionally, the metric of focus in networking has been bandwidth. As more and more parts of the Internet have their capacity upgraded, bandwidth is often not the main problem anymore. However, network-induced latency as measured in One-Way Delay (OWD) or Round-Trip Time often has noticeable impact on performance.

It's not just for gamers...

The one group of Internet users today that is most aware of the importance of latency are online gamers. It is intuitively obvious than in real-time multiplayer games over the network, players don't want to be put at a disadvantage because their actions take longer to reach the game server than their opponents'.

However, latency impacts the other users of the Internet as well, probably much more than they are aware of. At a given bottleneck bandwidth, a connection with lower latency will reach its achievable rate faster than one with higher latency. The effect of this should not be underestimated, since most connections on the Internet are short - in particular connections associated with Web browsing.

But even for long connections, latency often has a big impact on performance (throughput), because many of those long connections have their throughput limited by the window size that is available for TCP. And when the window size is the bottleneck, throughput is inversely proportional to round-trip time. Furthermore, for a given (small) loss rate, RTT places an upper limit on the achieveable throughput of a TCP connection, as shown by the Mathis Equation

But I thought latency was only important for multimedia!?

Common wisdom is that latency (and also jitter) is important for audio/video ("multimedia") applications. This is only partly true: Many applications of audio/video involve "on-demand" unidirectional transmission. In those applications, the real-time concerns can often be mitigated by clever buffering or transmission ahead of time. For conversational audio/video, such as "Voice over IP" or videoconferencing, the latency issue is very real. The principal sources of latency in these applications are not backbone-related, but related to compression/sampling rates (see packetization delay) and to transcoding devices such as H.323 MCUs (Multi-Channel Units).

References

-- SimonLeinen - 13 Dec 2004

 

Network Performance Metrics

There are many metrics that are commonly used to characterize the performance of networks and parts of networks. We present the most important of these, explain what influences them, how they can be measured, how they influence end-to-end performance, and what can be done to improve them.

A framework for network performance metrics has been defined by the IETF's IP Performance Metrics (IPPM) Working Group in RFC 2330. The group also developed definitions for several specific performance metrics; those are referenced from the respective sub-topics.

References

-- SimonLeinen - 31 Oct 2004

One-way Delay (OWD)

One-way delay is the time it takes for a packet to reach its destination. It is considered a property of network links or paths. RFC 2679 contains the definition of the one-way delay metric of the IETF's IPPM (IP Performance Metrics) working group.

Decomposition

One-way delay along a network path can be decomposed into per-hop one-way delays, and these in turn into per-link and per-node delay components.

Per-link delay components: propagation delay and serialization delay

The link-specific component of one-way delay consists of two sub-components:

Propagation Delay is the time it takes for signals to move from the sending to the receiving end of the link. On simple links, this is the product of the link's physical length and the characteristical propagation speed. The velocity of propagation (VoP) for copper and fibre optic are similar (the VoP of copper is slightly faster), being approximately 2/3 the speed of light in a vacuum.

Serialization delay is the time it takes for a packet to be serialized into link transmission units (typically bits). It is the packet size (in bits) divided by the link's capacity (in bits per second).

In addition to the propagation and serialization delays, some types of links may introduce additional delays, for example to avoid collisions on a shared link, or when the link-layer transparently retransmits packets that were damaged during transmission.

Per-node delay components: forwarding delay, queueing delay

Within a network node such as a router, a packet experiences different kinds of delay between its arrival on one link and its departure on another (or the same) link:

Forwarding delay is the time it takes for the node to read forwarding-relevant information (typically the destination address and other headers) from the packet, compute the "forwarding decision" based on routing tables and other information, and to actually forward the packet towards the destination, which involves copying the packet to a different interface inside the node, rewriting parts of it (such as the IP TTL and any media-specific headers) and possibly other processing such as fragmentation, accounting, or checking access control lists.

Depending on router architecture, forwarding can compete for resources with other activities of the router. In this case, packets can be held up until the router's forwarding resource is available, which can take many milliseconds on a router with CPU-based forwarding. Routers with dedicated hardware for forwarding don't have this problem, although there may be delays when a packet arrives as the hardware forwarding table is being reprogrammed due to a routing change.

Queueing delay is the time a packet has to wait inside the node waiting for availability of the output link. Queueing delay depends on the amount of competing traffic towards the output link, and of the priorities of the packet itself and the competing traffic. The amount of queueing that a given packet may encounter is hard to predict, but it is bounded by the buffer size available for queueing.

There can be causes for queueing other than contention on the outgoing link, for example contention on the node's backplane interconnect.

Impact on end-to-end performance

When studying end-to-end performance, it is usually more interesting to look at the following metrics that are derived from one-way delay:

  • Round-trip time (RTT) is the time from node A to B and back to A. It is the sum of the one-way delays from A to B and from B to A, plus the response time in B.
  • Delay Variation represents the variation in one-way delay. It is important for real-time applications. People often call this "jitter".

Measurement

One-way delays from a node A to a node B can be measured by sending timestamped packets from A, and recording the reception times at B. The difficulty is that A and B need clocks that are synchronized to each other. This is typically achieved by having clocks synchronized to a standard reference time such as UTC (Universal Time Coordinated) using techniques such as Global Positioning System (GPS)-derived time signals or the Network Time Protocol (NTP).

There are several infrastructures that continuously measure one-way delays and packet loss: The HADES Boxes in DFN and GÉANT2; RIPE TTM Boxes between various research and commercial networks, and RENATER's QoSMetrics boxes.

OWAMP and RUDE/CRUDE are examples of tools that can be used to measure one-way delay.

Improving delay

Shortest-(Physical)-Path Routing and Proper Provisioning

On high-speed wide-area network paths, delay is usually dominated by propagation times. Therefore, physical routing of network links plays an important role, as well as the topology of the network and the selection of routing metrics. Ensuring minimal delays is simply a matter of

  • using an internal routing protocol with "shortest-path routing" (such as OSPF or IS-IS) and a metric proportional to per-link delay
  • provisioning the network so that these shortest paths aren't congested.

This could be paraphrased as "use proper provisioning instead of traffic engineering".

The node contributions to delay can be addressed by:

  • using nodes with fast forwarding
  • avoiding queueing by provisioning links to accomodate typical traffic bursts
  • reducing the number of forwarding nodes

References

-- SimonLeinen - 31 Oct 2004 - 25 Jan 2007

Added:
>
>

Propagation Delay

The propagation delay is the time it takes for a signal to propagate. It depends on the distance traveled, and the specific propagation speed of the medium. For instance, information transmitted via radio or through copper cables will travel at a speed close to c (speed of light in vacuum, ~300000 km/s). The prevalent medium for long-distance digital transmission is now light in optical fibers, where the propagation speed is about 2/3 c, i.e. 200000 km/s.

Propagation delay, along with serialization delay and processing delays in nodes such as routers, is a component of overall delay/RTT. For uncongested long-distance network paths, it is usually the dominant component.

Examples

Here are a few examples for propagation delay components of one-way and round-trip delays over selected distances in fiber.

Fibre length One-way delay Round-trip time
1m 5 ns 10 ns
1km 5 µs 10 µs
10km 50 µs 100 µs
100km 500 µs 1 ms
1000km 5 ms 10 ms
10000km 50 ms 100 ms

-- SimonLeinen - 28 Feb 2006

Serialization Delay (or Transmission Delay)

Serialization delay is the time it takes for a unit of data, such as a packet, to be serialized for transmission on a narrow (e.g. serial) channel such as a cable. Serialization delay is dependent on size, which means that longer packets experience longer delays over a given network path. Serialization delay is also dependent on channel capacity ("bandwidth"), which means that for equal-size packets, the faster the link, the lower the serialization delay.

Serialization delays are incurred at processing nodes, when packets are stored-and-copied between links and (router/switch) buffers. This includes the copying over internal links in processing nodes, such as router backplanes/switching fabrics.

In the core of the Internet, serialization delay has largely become a non-issue, because link speeds have increased much faster over the past years than packets sizes. Therefore, the "hopcount" as shown by e.g. traceroute is a bad predictor for delay today.

Example Serialization Delays

To illustrate the effects of link rates and packet sizes on serialization delay, here is a table of some representative values. Note that the maximum packet size for most computers is 1500 bytes today, but 9000-byte "jumbo frames" are already supported by many research networks.

Link Rate 64 kb/s 1 Mb/s 10 Mb/s 100 Mb/s 1 Gb/s 10 Gb/s
9000 bytes 1125 ms 72 ms 7.2 ms 720 µs 72 µs 7.2 µs
Packet Size            
64 bytes 8 ms 0.512 ms 51.2 µs 5.12 µs 0.512 µs 51.2 ns
512 bytes 64 ms 4.096 ms 409.6 µs 40.96 µs 4.096 µs 409.6 ns
1500 bytes 187.5 ms 12 ms 1.2 ms 120 µs 12 µs 1.2 µs

-- SimonLeinen - 28 Oct 2004 - 17 Jun 2010

 

Round-Trip Time (RTT)

Round-trip time (RTT) is the total time for a packet sent by a node A to reach is destination B, and for a response sent back by B to reach A.

Decomposition

The round-trip time is the sum of the one-way delays from A to B and from B to A, and of the time it takes B to send the response.

Impact on end-to-end performance

For window-based transport protocols such as TCP, the round-trip time influences the achievable throughput at a given window size, because there can only be a window's worth of unacknowledged data in the network, and the RTT is the lower bound for a packet to be acknowledged.

For interactive applications such as conversational audio/video, instrument control, or interactive games, the RTT represents a lower bound on response time, and thus impacts responsiveness directly.

Measurement

The round-trip time is often measured with tools such as ping (Packet InterNet Groper) or one of its cousins such as fping, which send ICMP Echo requests to a destination, and measure the time until the corresponding ICMP Echo Response messages arrive.

However, please note that while round-trip time reported by PING is relatively precise measurement, some network devices may prioritize ICMP handling routines, so the measured values do not correspond to the real values.

-- MichalPrzybylski - 19 Jul 2005

Improving the round-trip time

The network components of RTT are the one-way delays in both directions (which can use different paths), so see the OneWayDelay topic on how those can be improved. The speed of response generation can be improved through upgrades of the responding host's processing power, or by optimizing the responding program.

-- SimonLeinen - 31 Oct 2004

Bandwidth-Delay Product (BDP)

The BDP of a path is the product of the (bottleneck) bandwidth and the delay of the path. Its dimension is "information", because bandwidth (here) expresses information per time, and delay is a time (duration). Typically, one uses bytes as a unit, and it is often useful to think of BDP as the "memory capacity" of a path, i.e. the amount of data that fits entirely into the path (between two end-systems).

BDP is an important parameter for the performance of window-based protocols such as TCP. Network paths with a large bandwidth-delay product are called Long Fat Networks or "LFNs".

References

-- SimonLeinen - 14 Apr 2005

"Long Fat Networks" (LFNs)

Long Fat Networks (LFNs, pronounced like "elephants") are networks with a high bandwidth-delay product.

One of the issues with this type of network is that it can be challenging to achieve high throughput for individual data transfers with window-based transport protocols such as TCP. LFNs are thus a main focus of research on high-speed improvements for TCP.

-- SimonLeinen - 27 Oct 2004 - 17 Jun 2005

Line: 23 to 25
 

Path MTU

The Path MTU is the Maximum Transmission Unit (MTU) supported by a network path. It is the minimum of the MTUs of the links (segments) that make up the path. Larger Path MTUs generally allow for more efficient data transfers, because source and destination hosts, as well as the switching devices (routers) along the network path have to process fewer packets. However, it should be noted that modern high-speed network adapters have mechanisms such as LSO (Large Send Offload) and Interrupt Coalescence that diminish the influence of MTUs on performance. Furthermore, routers are typically dimensioned to sustain very high packet loads (so that they can resist denial-of-service attacks) so the packet rates caused by high-speed transfers is not normally an issue for today's high-speed networks.

The prevalent Path MTU on the Internet is now 1500 bytes, the Ethernet MTU. There are some initiatives to support larger MTUs (JumboMTU) in networks, in particular on research networks. But their usability is hampered by last-mile issues and lack of robustness of RFC 1191 Path MTU Discovery.

Path MTU Discovery Mechanisms

Traditional (RFC1191) Path MTU Discovery

RFC 1191 describes a method for a sender to detect the Path MTU to a given receiver. (RFC 1981 describes the equivalent for IPv6.) The method works as follows:

  • The sending host will send packets to off-subnet destinations with the "Don't Fragment" bit set in the IP header.
  • The sending host keeps a cache containing Path MTU estimates per destination host address. This cache is often implemented as an extension of the routing table.
  • The Path MTU estimate for a new destination address is initialized to the MTU of the outgoing interface over which the destination is reached according to the local routing table.
  • When the sending host receives an ICMP "Too Big" (or "Fragmentation Needed and Don't Fragment Bit Set") destination-unreachable message, it learns that the Path MTU to that destination is smaller than previously assumed, and updates the estimate accordingly.
  • Normally, an ICMP "Too Big" message contains the next-hop MTU, and the sending host will use that as the new Path MTU estimate. The estimate can still be wrong because a subsequent link on the path may have an even smaller MTU.
  • For destination addresses with a Path MTU estimate lower than the outgoing interface MTU, the sending host will occasionally attempt to raise the estimate, in case the path has changed to support a larger MTU.
  • When trying ("probing") a larger Path MTU, the sending host can use a list of "common" MTUs, such as the MTUs associated with popular link layers, perhaps combined with popular tunneling overheads. This list can also be used to guess a smaller MTU in case an ICMP "Too Big" message is received that doesn't include any information about the next-hop MTU (maybe from a very very old router).

This method is widely implemented, but is not robust in today's Internet because it relies on ICMP packets sent by routers along the path. Such packets are often suppressed either at the router that should generate them (to protect its resources) or on the way back to the source, because of firewalls and other packet filters or rate limitations. These problems are described in RFC2923. When packets are lost due to MTU issues without any ICMP "Too Big" message, this is sometimes called a (MTU) black hole. Some operating systems have added heuristics to detect such black holes and work around them. Workarounds can include lowering the MTU estimate or disabling PMTUD for certain destinations.

Packetization Layer Path MTU Discovery (PLPMTUD, RFC 4821)

An IETF Working Group (pmtud) was chartered to define a new mechanism for Path MTU Discovery to solve these issues. This process resulted in RFC 4821, Packetization Layer Path MTU Discovery ("PLPMTUD"), which was published in March 2007. This scheme requires cooperation from a network layer above IP, namely the layer that performs "packetization". This could be TCP, but could also be a layer above UDP, let's say an RPC or file transfer protocol. PLPMTUD does not require ICMP messages. The sending packetization layer starts with small packets, and probes progressively larger sizes. When there's an indication that a larger packet was successfully transmitted to the destination (presumably because some sort of ACK was received), the Path MTU estimate is raised accordingly.

When a large packet was lost, this might have been due to an MTU limitation, but it might also be due to other causes, such as congestion or a transmission error - or maybe it's just the ACK that was lost! PLPMTUD recommends that the first time this happens, the sending packetization layer should assume an MTU issue, and try smaller packets. An isolated incident need not be interpreted as an indication of congestion.

An implementation of the new scheme for the Linux kernel was integrated into version 2.6.17. It is controlled by a "sysctl" value that can be observed and set through /proc/sys/net/ipv4/tcp_mtu_probing. Possible values are:

  • 0: Don't perform PLPMTUD
  • 1: Perform PLPMTUD only after detecting a "blackhole" in old-style PMTUD
  • 2: Always perform PLPMTUD, and use the value of tcp_base_mss as the initial MSS.

A user-space implementation over UDP is included in the VFER bulk transfer tool.

References

Implementations

  • for Linux 2.6 - integrated in mainstream kernel as of 2.6.17. However, it is disabled by default (see net.ipv4.tcp_mtu_probing sysctl)
  • for NetBSD

-- Hank Nussbacher - 2005-07-03 -- Simon Leinen - 2006-07-19 - 2018-07-02

Network Protocols

This section contains information about a few common Internet protocols, with a focus on performance questions.

For information about higher-level protocols, see under application protocols.

-- SimonLeinen - 2004-10-31 - 2015-04-26

TCP (Transmission Control Protocol)

The Transmission Control Protocol (TCP) is the prevalent transport protocol used on the Internet today. It uses window-based transmission to provide the service of a reliable byte stream, and adapts the rate of transfer to the state (of congestion) of the network and the receiver. Basic mechanisms include:

  • Segments that fit into IP packets, into which the byte-stream is split by the sender,
  • a checksum for each segment,
  • a Window, which bounds the amount of data "in flight" between the sender and the receiver,
  • Acknowledgements, by which the receiver tells the sender about segments that were received successfully,
  • a flow-control mechanism, which regulates data rate as a function of network and receiver conditions.

Originally specified in September 1981 RFC 793, TCP was clarified, refined and extended in many documents. Perhaps most importantly, congestion control was introduced in "TCP Tahoe" in 1988, described in Van Jacobson's 1988 SIGCOMM article on "Congestion Avoidance and Control". It can be said that TCP's congestion control is what keeps the Internet working when links are overloaded. In today's Internet, the enhanced "Reno" variant of congestion control is probably the most widespread.

RFC 7323 (formerly RFC 1323) specifies a set of "TCP Extensions for High Performance", namely the Window Scaling Option, which provides for much larger windows than the original 64K, the Timestamp Option and the PAWS (Protection Against Wrapped Sequence numbers) mechanism. These extensions are supported by most contemporary TCP stacks, although they frequently must be activated explicitly (or implicitly by configuring large TCP windows).

Another widely implemented performance enhancement to TCP are selective acknowledgements (SACK, RFC 2018). In TCP as originally specified, the acknowledgements ("ACKs") sent by a receiver were always "cumulative", that is, they specified the last byte of the part of the stream that was completely received. This is advantageous with large TCP windows, in particular where chances are high that multiple segments within a single window are lost. RFC 2883 describes an extension to SACK which makes TCP more robust in the presence of packet reordering in the network.

In addition, RFC 3449 - TCP Performance Implications of Network Path Asymmetry provides excellent information since a vast majority of Internet connections are asymmetrical.

Further Reading

References

  • RFC 7414, A Roadmap for Transmission Control Protocol (TCP) Specification Documents]], M. Duke, R. Braden, W. Eddy, E. Blanton, A. Zimmermann, February 2015
  • RFC 793, Transmission Control Protocol]], J. Postel, September 1981
  • draft-ietf-tcpm-rfc793bis-10, Transmission Control Protocol Specification, Wesley M. Eddy, July 2018
  • RFC 7323, TCP Extensions for High Performance, D. Borman, B. Braden, V. Jacobson, B. Scheffenegger, September 2014 (obsoletes RFC 1323 from May 1993)
  • RFC 2018, TCP Selective Acknowledgement Options]], M. Mathis, J. Mahdavi, S. Floyd, A. Romanow, October 1996
  • RFC 5681, TCP Congestion Control]], M. Allman, V. Paxson, E. Blanton, September 2009
  • RFC 3449, TCP Performance Implications of Network Path Asymmetry]], H. Balakrishnan, V. Padmanabhan, G. Fairhurst, M. Sooriyabandara, December 2002

There is ample literature on TCP, in particular research literature on its performance and large-scale behaviour.

Juniper has a nice white paper, Supporting Differentiated Service Classes: TCP Congestion Control Mechanisms (PDF format) explaining TCP's congestion control as well as many of the enhancements proposed over the years.

-- Simon Leinen - 2004-10-27 - 2018-07-02 -- Ulrich Schmid - 2005-05-31

Deleted:
<
<
-- TobyRodwell - 06 Apr 2005

  • Active open connection A TCP connection that a host has initiated i.e it sent the SYN
  • Passive open connection A TCP connection that is in state LISTENING
  • Pure ACK A TCP segment containing no data, just acknowledging sent data
  • Nagle algorithm The Nagle algorithm prevents the sending of a small segments if there are any outstanding unacknowledged data. This reduces the amount of tinygrams being sent. It is particularly elegant because on fast LAN ACKs will arrive quickly (such that the algorithm has minimal effect) whereas on a slow WAN the ACKs will arrive more slowly, so there will not be so many tinygrams. Users of high-speed networks, where capacity is not a significant issue but OWD is, may choose to disable the Nagle algorithm.
 

Window-Based Transmission

TCP is a sliding-window protocol. The receiver tells the sender the available buffer space at the receiver (TCP header field "window"). The total window size is the minimum of sender buffer size, advertised receiver window size and congestion window size.

The sender can transmit up to this amount of data before having to wait for further buffer update from the receiver and should not have more than this amount of data in transit in the network. The sender must buffer the sent data until it has been ACKed by the receiver, so that the data can be retransmitted if neccessary. For each ACK the sent segment left the window and a new segment fills the window if it fits the (possibly updated) window buffer.

Due to TCP's flow control mechanism, TCP window size can limit the maximum theoretical throughput regardless of the bandwidth of the network path. Using too small a TCP window can degrade the network performance lower than expected and a too large window may have the same problems in case of error recovery.

The TCP window size is the most important parameter for achieving maximum throughput across high-performance networks. To reach the maximum transfer rate, the TCP window should be no smaller than the bandwidth-delay product.

Window size (bytes) => Bandwidth (bytes/sec) x Round-trip time (sec)

Example:

window size: 8192 bytes
round-trip time: 100ms
maximum throughput: < 0.62 Mbit/sec.

References

<--   * Calculating TCP goodput, Javascript-based on-line calculator by V. Reijs, https://webmail.heanet.ie/~vreijs/tcpcalculations.htm -->

-- UlrichSchmid & SimonLeinen - 31 May-07 Jun 2005

Large TCP Windows

In order to achieve high data rates with TCP over "long fat networks", i.e. network paths with a large bandwidth-delay product, TCP sinks (that is, hosts receiving data transported by TCP) must advertise a large TCP receive window (referred to as just 'the window', since there is not an equivalent advertised 'send window').

The window is a 16 bit value (bytes 15 and 16 in the TCP header) and so, in TCP as originally specified, it is limited to a value of 65535 (64K). The receive window sets an upper limit on the sustained throughput achieveable over a TCP connection since it represents the maximum amount of unacknowledged data (in bytes) there can be on the TCP path. Mathematically, achieveable throughput can never be more than WINDOW_SIZE/RTT, so for a trans-Atlantic link, with say an RTT (Round trip Time) of 150ms, the throughput is limited to a maximum of 3.4Mbps. With the emergence of "long fat networks", the limit of 64K bytes (on some systems even just 32K bytes!) was clearly insufficient and so RFC 7323 laid down (amongst other things) a way of scaling the advertised window, such that the 16-bit window value can represent numbers larger than 64K.

RFC 7323 extensions

RFC 7323, TCP Extensions for High Performance, (and formerly RFC 1323) defines several mechanisms to enable high-speed transfers over LFNs: Window Scaling, TCP Timestamps, and Protection Against Wrapped Sequence numbers (PAWS).

The TCP window scaling option increases the maximum window size fom 64KB to 1Gbyte, by shifting the window field left by up to 14. The window scale option is used only during the TCP 3-way handshake (both sides must set the window scale option in their SYN segments if scaling is to be used in either direction).

It is important to use TCP timestamps option with large TCP windows. With the TCP timestamps option, each segment contains a timestamp. The receiver returns that timestamp in each ACK and this allows the sender to estimate the RTT. On the other hand with the TCP timestamps option the problem of wrapped sequence number could be solved (PAWS - Protection Against Wrapped Sequences) which could occur with large windows.

(Auto) Configuration

In the past, most operating systems required manual tuning to use large TCP windows. The OS-specific tuning section contains information on how to to this for a set of operating systems.

Since around 2008-2010, many popular operating systems will use large windows and the necessary protocol options by default, thanks to TCP Buffer Auto-Tuning.

Can TCP Windows ever be too large?

There are several potential issues when TCP Windows are larger than necessary:

  1. When there are many active TCP connection endpoints (sockets) on a system - such as a popular Web or file server - then a large TCP window size will lead to high consumption of system (kernel) memory. This can have a number of negative consequences: The system may run out of buffer space so that no new connections can be opened, or the high occupation of kernel memory (which typically must reside in actual RAM and cannot be "paged out" to disk) can "starve" other processes of access to fast memory (cache and RAM)
  2. Large windows can cause large "bursts" of consecutive segments/packets. When there is a bottleneck in the path - because of a slower link or because of cross-traffic - these bursts will fill up buffers in the network device (router or switch) in front of that bottleneck. The larger these bursts, the higher are the risks that this buffer overflows and causes multiple segments to be dropped. So a large window can lead to "sawtooth" behavior and worse link utilisation than with a just-big-enough window where TCP could operate at a steady rate.

Several methods for automatic TCP buffer tuning have been developed to resolve these issues, some of which have been implemented in (at least) recent Linux versions.

References

-- SimonLeinen - 27 Oct 2004 - 27 Sep 2014

TCP Acknowledgements

In TCP's sliding-window scheme, the receiver acknowledges the data it receives, so that the sender can advance the window and send new data. As originally specified, TCP's acknowledgements ("ACKs") are cumulative: the receiver tells the sender how much consecutive data it has received. More recently, selective acknowledgements were introduced to allow more fine-grained acknowledgements of received data.

Delayed Acknowledgements and "ack every other"

RFC 831 first suggested a delayed acknowledgement (Delayed ACK) strategy, where a receiver doesn't always immediately acknowledge segments as it receives them. This recommendation was carried forth and specified in more detail in RFC 1122 and RFC 5681 (formerly known as RFC 2581). RFC 5681 mandates that an acknowledgement be sent for at least every other full-size segment, and that no more than 500ms expire before any segment is acknowledged.

The resulting behavior is that, for longer transfers, acknowledgements are only sent for every two segments received ("ack every other"). This is in order to reduce the amount of reverse flow traffic (and in particular the number of small packets). For transactional (request/response) traffic, the delayed acknowledgement strategy often makes it possible to "piggy-back" acknowledgements on response data segments.

A TCP segment which is only an acknowledgement i.e. has no payload, is termed a pure ack.

Delayed Acknowledgments should be taken into account when doing RTT estimation. As an illustration, see this change note for the Linux kernel from Gavin McCullagh.

Critique on Delayed ACKs

John Nagle nicely explains problems with current implementations of delayed ACK in a comment to a thread on Hacker News:

Here's how to think about that. A delayed ACK is a bet. You're betting that there will be a reply, upon with an ACK can be piggybacked, before the fixed timer runs out. If the fixed timer runs out, and you have to send an ACK as a separate message, you lost the bet. Current TCP implementations will happily lose that bet forever without turning off the ACK delay. That's just wrong.

The right answer is to track wins and losses on delayed and non-delayed ACKs. Don't turn on ACK delay unless you're sending a lot of non-delayed ACKs closely followed by packets on which the ACK could have been piggybacked. Turn it off when a delayed ACK has to be sent.

Duplicate Acknowledgements

A duplicate acknowledgement (DUPACK) is one with the same acknowledgement number as its predecessor - it signifies that the TCP receiver has received a segment newer than the one it was expecting i.e. it has missed a segment. The missed segment might not be lost, it might just be re-ordered. For this reason the TCP sender will not assume data loss on the first DUPACK but (by default) on the third DUPACK, when it will, as per RFC 5681, perform a "Fast Retransmit", by sending the segment again without waiting for a timeout. DUPACKs are never delayed; they are sent immediately the TCP receiver detects an out-of-order segment.

"Quickack mode" in Linux

The TCP implementation in Linux has a special receive-side feature that temporarily disables delayed acknowledgements/ack-every-other when the receiver thinks that the sender is in slow-start mode. This speeds up the slow-start phase when RTT is high. This "quickack mode" can also be explicitly enabled or disabled with setsockopt() using the TCP_QUICKACK option. The Linux modification has been criticized because it makes Linux' slow-start more aggressive than other TCPs' (that follow the SHOULDs in RFC 5681), without sufficient validation of its effects.

"Stretch ACKs"

Techniques such as LRO and GRO (and to some level, Delayed ACKs as well as ACKs that are lost or delayed on the path) can cause ACKs to cover many more than the two segments suggested by the historic TCP standards. Although ACKs in TCP have always been defined as "cumulative" (with the exception of SACKs), some congestion control algorithms have trouble with ACKs that are "stretched" in this way. In January 2015, a patch set was submitted to the Linux kernel network development with the goal to improve the behavior of the Reno and CUBIC congestion control algorithms when faced with stretch ACKs.

References

  • TCP Congestion Control, RFC 5681, M. Allman, V. Paxson, E. Blanton, September 2009
  • RFC 813, Window and Acknowledgement Strategy in TCP, D. Clark, July 1982
  • RFC 1122, Requirements for Internet Hosts -- Communication Layers, R. Braden (Ed.), October 1989

-- TobyRodwell - 2005-04-05
-- SimonLeinen - 2007-01-07 - 2015-01-28

Deleted:
<
<
To improve the TCP performance many enhancements on TCP were done in the last years. E.g. implementation of improved TCP flow control algorithms (larger default buffer size, fast retransmit and recovery support, path-MTU discovery) in the TCP stack and development of high speed TCP variants.
Additional TCP options were introduced to increase the overall TCP throughput as:

-- UlrichSchmid - 10 Jun 2005

 

TCP Window Scaling Option

TCP as originally specified only allows for windows up to 65536 bytes, because the window-related protocol fields all have a width of 16 bits. This prevents the use of large TCP windows, which are necessary to achieve high transmission rates over paths with a large bandwidth*delay product.

The Window Scaling Option is one of several options defined in RFC 7323, "TCP Extensions for High Performance". It allows TCP to advertise and use a window larger than 65536 bytes. The way this works is that in the initial handshake, the a TCP speaker announces a scaling factor, which is a power of two between 20 (no scaling) and 214, to allow for an effective window of 230 (one Gigabyte). Window Scaling only comes into effect when both ends of the connection advertise the option (even with just a scaling factor of 20).

As explained under the large TCP windows topic, Window Scaling is typically used in connection with the TCP Timestamp option and the PAWS (Protection Against Wrapped Sequence numbers), both defined in RFC 7323.

32K limitations on some systems when Window Scaling is not used

Some systems, notably Linux, limit the usable window in the absence of Window Scaling to 32 Kilobytes. The reasoning behind this is that some (other) systems used in the past were broken by larger windows, because they erroneously interpreted the 16-bit values related to TCP windows as signed, rather than unsigned, integers.

This limitation can have the effect that users (or applications) that request window sizes between 32K and 64K, but don't have Window Scaling enabled, do not actually benefit from the desired window sizes.

The artificial limitation was finally removed from the Linux kernel sources on March 21, 2006. The old behavior (window limited to 32KB when Window Scaling isn't used) can still be enabled through a sysctl tuning parameter, in case one really wants to interoperate with TCP implementations that are broken in the way described above.

Problems with window scaling

Middlebox issues

MiddleBoxes which do not understand window scaling may cause very poor performance, as described in WindowScalingProblems. Windows Vista limits the scaling factor for HTTP transactions to 2 to avoid some of the problems, see WindowsOSSpecific.

Diagnostic issues

The Window Scaling option has an impact on how the offered window in TCP ACKs should be interpreted. This can cause problems when one is faced with incomplete packet traces that lack the initial SYN/SYN+ACK handshake where the options are negotiated: The SEQ/ACK sequence may look incorrect because scaling cannot be taken into account. One effect of this is that tcpgraph or similar analysis tools (included in tools such as WireShark) may produce incoherent results.

References

  • RFC 7323, TCP Extensions for High Performance, D. Borman, B. Braden, V. Jacobson, B. Scheffenegger, September 2014 (obsoletes RFC 1323 from May 1993)
  • git note about the Linux kernel modification that got rid of the 32K limit, R. Jones and D.S. Miller, March 2006. Can also be retrieved as a patch.

-- SimonLeinen - 21 Mar 2006 - 27 September 2014
-- PekkaSavola - 07 Nov 2006
-- AlexGall - 31 Aug 2007 (added reference to scaling limitiation in Windows Vista)

TCP Flow Control

Note: This topic describes the Reno enhancement of classical "Van Jacobson" or Tahoe congestion control. There have been many suggestions for improving this mechanism - see the topic on high-speed TCP variants.

TCP flow control and window size adjustment is mainly based on two key mechanism: Slow Start and Additive Increase/Multiplicative Decrease (AIMD), also known as Congestion Avoidance. (RFC 793 and RFC 5681)

Slow Start

To avoid that a starting TCP connection floods the network, a Slow Start mechanism was introduced in TCP. This mechanism effectively probes to find the available bandwidth.

In addition to the window advertised by the receiver, a Congestion Window (cwnd) value is used and the effective window size is the lesser of the two. The starting value of the cwnd window is set initially to a value that has been evolving over the years, the TCP Initial Window. After each acknowledgment, the cwnd window is increased by one MSS. By this algorithm, the data rate of the sender doubles each round-trip time (RTT) interval (actually, taking into account Delayed ACKs, rate increases by 50% every RTT). For a properly implemented version of TCP this increase continues until:

  • the advertised window size is reached
  • congestion (packet loss) is detected on the connection.
  • there is no traffic waiting to take advantage of an increased window (i.e. cwnd should only grow if it needs to)

When congestion is detected, the TCP flow-control mode is changed from Slow Start to Congestion Avoidance. Note that some TCP implementations maintain cwnd in units of bytes, while others use units of full-sized segments.

Congestion Avoidance

Once congestion is detected (through timeout and/or duplicate ACKs), the data rate is reduced in order to let the network recover.

Slow Start uses an exponential increase in window size and thus also in data rate. Congestion Avoidance uses a linear growth function (additive increase). This is achieved by introducing - in addition to the cwnd window - a slow start threshold (ssthresh).

As long as cwnd is less than ssthresh, Slow Start applies. Once ssthresh is reached, cwnd is increased by at most one segment per RTT. The cwnd window continues to open with this linear rate until a congestion event is detected.

When congestion is detected, ssthresh is set to half the cwnd (or to be strictly accurate, half the "Flight Size". This distinction is important if the implementation lets cwnd grow beyond rwnd (the receiver's declared window)). cwnd is either set to 1 if congestion was signalled by a timeout, forcing the sender to enter Slow Start, or to ssthresh if congestion was signalled by duplicate ACKs and the Fast Recovery algorithm has terminated. In either case, once the sender enters Congestion Avoidance, its rate has been reduced to half the value at the time of congestion. This multiplicative decrease causes the cwnd to close exponentially with each detected loss event.

Fast Retransmit

In Fast Retransmit, the arrival of three duplicate ACKs is interpreted as packet loss, and retransmission starts before the retransmission timer (RTO) expires.

The missing segment will be retransmitted immediately without going through the normal retransmission queue processing. This improves performance by eliminating delays that would suspend effective data flow on the link.

Fast Recovery

Fast Recovery is used to react quickly to a single packet loss. In Fast recovery, the receipt of 3 duplicate ACKs, while being taken to mean a loss of a segment, does not result in a full Slow Start. This is because obviously later segments got through, and hence congestion is not stopping everything. In fast recovery, ssthresh is set to half of the current send window size, the missing segment is retransmitted (Fast Retransmit) and cwnd is set to ssthresh plus three segments. Each additional duplicate ACK indicates that one segment has left the network at the receiver and cwnd is increased by one segment to allow the transmission of another segment if allowed by the new cwnd. When an ACK is received for new data, cwmd is reset to the ssthresh, and TCP enters congestion avoidance mode.

References

-- UlrichSchmid - 07 Jun 2005 -- SimonLeinen - 27 Jan 2006 - 23 Jun 2011

High-Speed TCP Variants

There have been numerous ideas for improving TCP over the years. Some of those ideas have been adopted by mainstream operations (after thorough review). Recently there has been an uptake in work towards improving TCP's behavior with LongFatNetworks. It has been proven that the current congestion control algorithms limit the efficiency in network resource utilization. The various types of new TCP implementations introduce changes in defining the size of congestion window. Some of them require extra feedback from the network. Generally, all such congestion control protocols are divided as follows.

An orthogonal technique for improving TCP's performance is automatic buffer tuning.

Comparative Studies

All papers about individual TCP-improvement proposals contain comparisons against older TCP to quantify the improvement. There are several studies that compare the performance of the various new TCP variants, including

Other References

  • Gigabit TCP, G. Huston, The Internet Protocol Journal, Vol. 9, No. 2, June 2006. Contains useful descriptions of many modern TCP enhancements.
  • Faster, G. Huston, June 2005. This article from ISP Column looks at various approaches to TCP transfers at very high rates.
  • Congestion Control in the RFC Series, M. Welzl, W. Eddy, July 2006, Internet-Draft (work in progress)
  • ICCRG Wiki, IRTF (Internet Research Task Force) ICCRG (Internet Congestion Control Research Group), includes bibliography on congestion control.
  • RFC 6077: Open Research Issues in Internet Congestion Control, D. Papadimitriou (Ed.), M. Welzl, M. Scharf, B. Briscoe, February 2011

Related Work

  • PFLDnet (International Workshop on Protocols for Fast Long-Distance Networks): 2003, 2004, 2005, 2006, 2007, 2009, 2010. (The 2008 pages seem to have been removed from the Net.)

-- Simon Leinen - 2005-05-28–2016-09-18 -- Orla Mc Gann - 2005-10-10 -- Chris Welti - 2008-02-25

Line: 106 to 106
 

Network Emulation Tools

In general, it is difficult to assess performance of distributed applications and protocols before they are deployed on the real network, because the interaction with network impairments such as delay is hard to predict. Therefore, researchers and practitioners often use emulation to mimic deployment (typically wide-area) networks for use in laboratory testbeds.

The emulators listed below implement logical interfaces with configurable parameters such as delay, bandwidth (capacity) and packet loss rates.

-- TobyRodwell - 06 Apr 2005 -- SimonLeinen - 15 Dec 2005

Linux netem (introduced in recent 2.6 kernels)

NISTnet works only on 2.4.x Linux kernels, so for having the best support for recent GE cards, I would recommend kernel 2.6.10 or .11 and the new netem (Network emulator). A lot of similar functionality is built on this special QoS queue. It is buried into menuconfig:

 Networking -->
   Networking Options -->
     QoS and/or fair queuing -->
        Network emulator

You will need a recent iproute2 package that supports netem. The tc command is used to configure netem characteristics such as loss rates and distribution, delay and delay variation, and packet corruption rate (packet corruption was added in Linux version 2.6.16).

References

-- TobyRodwell - 14 Apr 2005
-- SimonLeinen - 21 Mar 2006 - 07 Jan 2007 -- based on information from David Martinez Moreno (RedIRIS). Larry Dunn from Cisco has set up a portable network emulator based on NISTnet. He wrote this on it:

My systems started as 2x100BT, but not for the (cost) reasons you suspect. I have 3 cheap Linux systems in a mobile rack (3x1RU, <$5k USD total). They are attached in a "row", 2 with web100, 1 with NISTNet like this: web100 - NISTNet - web100. The NISTNet node acts as a variable delay-line (emulates different fiber propagation delays), and as a place to inject loss, if desired. DummyNet/BSD could work as well.

I was looking for a very "shallow" (front-to-back) 1RU system, so I could fit it into a normal-sized Anvil case (for easy shipping to Networkers, etc). Penguin Computing was one of 2 companies I found that was then selling such "shallow" rack-mount systems. The shallow units were "low-end" servers, so they only had 2x100BT ports.

Their current low-end systems have two 1xGE ports, and are (still) around $1100 USD each. So a set of 3 systems (inc. rack), Linux preloaded, 3-yr warranty, is still easily under $5k (Actaully, closer to $4k, including a $700 Anvil 2'x2'x2' rolling shock-mount case, and a cheap 8-port GE switch). Maybe $5k if you add more memory & disk...

Here's a reference to their current low-end stuff:

http://penguincomputing.com/products/servers/relion1XT.php?PHPSESSID=f90f88b4d54aad4eedc2bdae33796032

I have no affiliation with Penguin, just happened to buy their stuff, and it's working OK (though the fans are pretty loud). I suppose there are comparable vendors in Europe.

-- TobyRodwell - 14 Apr 2005

Deleted:
<
<

dummynet

A flexible tool for testing networking protocols, dummynet has been included in FreeBSD since 1998. A description of dummynet can be found on its homepage.

References

-- SimonLeinen - 15 Dec 2005

Sun hxbt STREAMS Module

From an announcement on the OpenSolaris Networking community site:

Hxbt is a Stream module/driver which emulates a WAN environment. It captures packets from IP and manipulates the packets according to the WAN setup. There are five parameters to control the environment. They are network propagation delay, bandwidth, drop rate, reordering, and corruption. Hxbt acts like a "pipe" between two hosts. And the pipe is the emulated WAN.

For more information download the source or take a look at the README.

-- SimonLeinen - 15 Dec 2005

 

End-System (Host) Tuning

This section contains hints for tuning end systems (hosts) for maximum performance.

References

-- SimonLeinen - 31 Oct 2004 - 23 Jul 2009
-- PekkaSavola - 17 Nov 2006

Calculating required TCP buffer sizes

Regardless of operating system, TCP buffers must be appropriately dimensioned if they are to support long distance, high data rate flows. Victor Reijs's TCP Throughput calculator is able to determine the maximum achievable TCP throughput based on RTT, expected packet loss, segment size, re-transmission timeout (RTO) and buffer size. It can be found at:

http://people.heanet.ie/~vreijs/tcpcalculations.htm

Tolerance to Packet Re-ordering

As explained in the section on packet re-ordering, packets can be re-ordered wherever there is parellelism in a network. If such parellelism exists in a network (e.g. as caused by the internal architectire of the Juniper M160s used in the GEANT network) then TCP should be set to tolerate a larger number of re-ordered packets than the default, which is 3 (see here for an example for Linux). For large flows across GEANT a value of 20 should be sufficient - an alternative work-round is to use multiple parallel TCP flows.

Modifying TCP Behaviour

TCP Congestion control algorithms are often configured conservatively, to avoid 'slow start' overwhelming a network path. In a capacity rich network this may not be necessary, and may be over-ridden.

-- TobyRodwell - 24 Jan 2005

-- TobyRodwell - 24 Jan 2005

[The following information is courtesy of Jeff Boote, from Internet2, and was forwarded to me by Nicolas Simar.]

Operating Systems (OSs) have a habit of reconfiguring network interface settings back to their default, even when the correct values have been written in a specific config file. This is the result of bugs, and they appear in almost all OSs. Sometimes they get fixed in a given release but then get broken again in a later release. It is not known why this is the case but it may be partly that driver programmers don't test their products under conditions of large latency. It is worth noting that experience shows 'ifconfig' works well for ''txqueuelength' and 'MTU'.

Line: 127 to 125
 

"Chatty" Protocols

A common problem with naively designed application protocols is that they are too "chatty", i.e. they imply too many "round-trip" cycles where one party has to wait for a response from the other. It is an easy mistake to make, because when testing such a protocol locally, these round-trips usually don't have much of an impact on overall performance. But when used over network paths with large RTTs, chattiness can dramatically impact perceived performance.

Example: SMTP (Simple Mail Transfer Protocol)

The Simple Mail Transfer Protocol (SMTP) is used to transport most e-mail messages over the Internet. In its original design (RFC 821, now superseded by RFC 2821), the protocol consisted of a strict sequence of request/response transactions, some of them very small. Taking an example from RFC 2920, a typical SMTP conversation between a client "C" that wants to send a message, and a server "S" that receives it, would look like this:

   S: <wait for open connection>
   C: <open connection to server>
   S: 220 Innosoft.com SMTP service ready
   C: HELO dbc.mtview.ca.us
   S: 250 Innosoft.com
   C: MAIL FROM:<mrose@dbc.mtview.ca.us>
   S: 250 sender <mrose@dbc.mtview.ca.us> OK
   C: RCPT TO:<ned@innosoft.com>
   S: 250 recipient <ned@innosoft.com> OK
   C: RCPT TO:<dan@innosoft.com>
   S: 250 recipient <dan@innosoft.com> OK
   C: RCPT TO:<kvc@innosoft.com>
   S: 250 recipient <kvc@innosoft.com> OK
   C: DATA
   S: 354 enter mail, end with line containing only "."
    ...
   C: .
   S: 250 message sent
   C: QUIT
   S: 221 goodbye

This simple conversation contains nine places where the client waits for a response from the server.

In order to improve this, the PIPELINING extension (RFC 2920) was later defined. When the server supports it - as signaled through the ESMTP extension mechanism in the response to an EHLO request - the client is allowed to send multiple requests in a row, and collect the responses later. The previous conversation becomes the following one with PIPELINING:

   S: <wait for open connection>
   C: <open connection to server>
   S: 220 innosoft.com SMTP service ready
   C: EHLO dbc.mtview.ca.us
   S: 250-innosoft.com
   S: 250 PIPELINING
   C: MAIL FROM:<mrose@dbc.mtview.ca.us>
   C: RCPT TO:<ned@innosoft.com>
   C: RCPT TO:<dan@innosoft.com>
   C: RCPT TO:<kvc@innosoft.com>
   C: DATA
   S: 250 sender <mrose@dbc.mtview.ca.us> OK
   S: 250 recipient <ned@innosoft.com> OK
   S: 250 recipient <dan@innosoft.com> OK
   S: 250 recipient <kvc@innosoft.com> OK
   S: 354 enter mail, end with line containing only "."
    ...
   C: .
   C: QUIT
   S: 250 message sent
   S: 221 goodbye

There are still a couple of places where the client has to wait for responses, notably during initial negotiation; but the number of these situations has been reduced to those where the response has an impact on further processing. The PIPELINING extension reduces the number of "turn-arounds" from nine to four. This speeds up the overall mail submission process when the RTT is high, reduces the number of packets that have to be sent (because several requests, or several responses, can be sent as a single TCP segment), and significantly decreases the risk of timeouts (and consequent loss of connection) when the connectivity between client and server is really bad.

The X Window System protocol (X11) is an example of a protocol that has been designed from the start to reduce turn-arounds.

References

-- SimonLeinen - 05 Jul 2005

Performance-friendly I/O interfaces

read()/write() vs. mmap()/write() vs. sendfile()

For applications with high input/output performance requirements (including network I/O), it is worthwhile to look at operating system support for efficient I/O routines.

As an example, here is simple pseudo-code that reads the contents of an open file in and writes them to an open socket out - this code could be part of a file server. A straightforward way of coding this uses the read()/write() system calls to copy the bytes through a memory buffer:

#define BUFSIZE 4096
long send_file (int in, int out) {
  unsigned char buffer[BUFSIZE];
  int result; long written = 0;
  while (result = read (in, buffer, BUFSIZE) > 0) {
    if (write (out, buffer, result) != result)
      return -1;
    written += result;
  }
  return (result == 0 ? written : result);
}

Unfortunately, this common programming paradigm results in high memory traffic and inefficient use of a system's caches. Also, if a small buffer is used, the number of system operations and, in particular, of user/kernel context switches will be quite high.

On systems that support memory mapping of files using mmap(), the following is more efficient if the source is an actual file:

#define BUFSIZE 4096
long send_file (int in, int out) {
  unsigned char *b;
  struct stat st;
  if (fstat (in, &st) == -1) return -1;
  if ((b = mmap (0, st.st_size, PROT_READ, 0)) == -1)
    return -1;
  madvise (b, st.st_size, MADV_SEQUENTIAL);
  return write (out, b, st.st_size);
}

An even more efficient - and also more concise - variant is the sendfile() call, which directly copies the bits from the file to the network.

long send_file (int in, int out) {
  struct stat st;
  if (fstat (in, &st) == -1) return -1;
  off_t offset = 0;
  return sendfile (out, in, &offset, st.st_size);
}

Note that an operating system could optimize this internally up to the point where data blocks are copied directly from the disk controller to the network controller without any involvement of the CPU.

For more complex situations, the sendfilev() interface can be used to send data from multiple files and memory buffers to construct complex protocol units with a single call.

-- SimonLeinen - 26 Jun 2005

One thing to note, the above usage of write and sendfile is simplified, these system-calls can stop in the middle and return the number of bytes written, for real-world usage you should have a loop around them to continue sending the rest of the file and handle signal errors.

The first loop should be written as:

#define BUFSIZE 4096
long send_file (int in, int out) {
  unsigned char buffer[BUFSIZE];
  int result; long written = 0;
  while (result = read (in, buffer, BUFSIZE) > 0) {
    ssize_t tosend = result;
    ssize_t offset = 0;
    while (tosend > 0) {
      result = write (out, buffer + offset, tosend);
      if (result == -1) {
        if (errno == EINTR || errno == EAGAIN)
          continue;
        else
          return -1;
      }
      written += result;
      offset += result;
      tosend -= result;
    }
  }
  return (result == 0 ? written : result);
}

References

  • Project Volo Design Document, OpenSolaris, 2008. This document contains, in section 4.5 ("zero-copy interface") a description of how sendfilev is implemented in current Solaris, as well as suggestions on generalizing the internal kernel interfaces that support it.

-- BaruchEven - 05 Jan 2006
-- SimonLeinen - 12 Sep 2008

Network Tuning

This section describes a few ways that can be used to improve performance of a network.

-- SimonLeinen - 01 Nov 2004 - 30 Mar 2006
-- PekkaSavola - 13 Jun 2008 (addition of WAN accelerators)
-- Alessandra Scicchitano - 11 Jun 2012 (Buffer Bloat)

Added:
>
>

Router Architectures

Basic Functions

Basic functions of an IP router can be divided into two main groups:

  • The forwarding plane is responsible for packet forwarding (or packet switching), which is the act of receiving packets on the router's interfaces and sending them out on (usually) other interfaces.
  • The control plane gathers and maintains network topology information, and passes it to the forwarding plane so that it knows where to forward the received packets.

Architectures

CPU-based vs. hardware forwarding support

Small routers usually have similar architectures to general purpose computers. They have a CPU, operational memory, some persistent storage device (to store configuration settings and the operating system software), and network interfaces. Both forwarding and control functions are carried out by the CPU, and being different processes is the only separation between the forwarding and the control plane. Network interfaces in these routers are treated just like NICs in general purpose computers, or as any other peripheral device.

High performance routers however, in order to achieve multi-Gbps throughput, use specialized hardware to forward packets. (The control plane is still very similar to a general-purpose computer architecture.) This way the forwarding and the control plane are far more separated. They do not contend for shared resources like processing power, because they both have their own.

Centralized vs. distributed forwarding

Another difference between low-end and high-end routers is that in low-end routers, packets from all interfaces are forwarded using a central forwarding engine, while some high-end routers decentralize forwarding across line cards, so that packets received on an interface are handled by a forwarding engine local to that line card. Distributed forwarding is especially attractive where one wants to scale routers to large numbers of line cards (and thus interfaces). Some routers allow line cards with and without forwarding engines to be mixed in the same chassis - packets arriving on engine-less line cards are simply handled by the central engine.

On distributed architectures, the line cards typically have their own copy of the forwarding table, and run very few "control-plane" functions - usually just what's necessary to communicate with the (central) control-plane engine. So even when forwarding is CPU-based, a router with distributed forwarding behaves much like a hardware-based router, in that forwarding performance is decoupled from control-plane performance.

There are examples for all kinds of combinations of CPU-/hardware-based and centralized/distributed forwarding, even within the products of a single vendor, for example Cisco:

  CPU-based hardware forw.
centralized 7200 7600 OSR
distributed 7500 w/VIP 7600 w/DFCs

Effects on performance analysis

Because of the separation of the forwarding and the control plane in high performance routers, the performance characteristics of traffic passing through the router and traffic destined to the router may be significantly different. The reason behind the difference is that transit traffic may be handled completely by the separated forwarding plane (for which this is the function it is optimized for), while traffic destined to the router is passed to the control plane and handled by the control plane CPU, which may have more important tasks at the moment (e.g. running routing protocols, calculating routes), and even if it's free, it cannot process traffic as efficiently as the forwarding plane.

This means that performance metrics of intermediate hops obtained from ping or traceroute-like tools may be misleading, as they may show significantly worse metrics than those of the analyzed "normal" transit traffic. In other words, it may happen easily that a router is forwarding transit traffic correctly, fast, without any packet loss, etc., while a traceroute through the router or pinging the router shows packet loss, high round-trip time, high delay variation, or some other bad things.

However, the control plane CPU of a high performance router normally is not always so busy, and in quiet periods it can deal well with the probe traffic directed to it. So intermediate hop probe traffic measurement results should be usually interpreted by dropping the occasional bad values, based on the assumption that the intermediate router's forwarding plane CPU had more important things to do than dealing with the probe traffic.

Beware the slow path

Routers with hardware support for forwarding often restrict this support to a common subset of traffic and configurations. "Unusual" packets may be handled in a "slow path" using the general-purpose CPU which is otherwise mostly dedicated to control-plane tasks. This often includes packets with IP options.

Some routers also revert from hardware-forwarding to CPU-based when certain complex functions are configured on the device - for example when the router has to do NAT or other payload-inspecting features. Another reason for reverting to CPU-based forwarding is when some resources in the forwarding hardware become exhausted, such as the forwarding table ("hardware" equivalent of the routing table) or access control tables.

To effectively use routers with hardware-based forwarding, it is therefore essential to know the restrictions of the hardware, and to ensure that the majority of traffic can indeed be hardware-switched. Where this is supported, it may be a good idea to limit the amount of "slow-path" traffic using rate-limits such as Cisco's "Control Plane Policing" (CoPP).

-- AndrasJako - 06 Mar 2006

 

Active Queue Management (AQM)

Packet-switching nodes such as routers usually need to accomodate queues to buffer packets, when incoming traffic exceeds the available outbound capacity. Traditionally, these buffers have been organised as tail-drop queues, where packets are queued until the buffer is full, and when it is full, newly arriving packets are dropped until the queue empties out. With bursty traffic (as is typical with TCP/IP), this can lead to entire bursts of arriving packets to be dropped because of a full queue. The effects of this are synchronisation of flows and a decrease of aggregate throughput. Another effect of the tail-drop queueing strategy is that, when congestion is long-lived, the queue will grow to fill the buffer, and will remain large until congestion eventually subsides. With large router buffers, this leads to increased one-way delay and round-trip times, which impacts network performance in various ways - see BufferBloat.

This had lead to the idea of "active queue management", where network nodes send congestion signals once they sense the onset of congestion, to avoid buffers filling up completely.

Active Queue Management is a precondition for Explicit Congestion Notification (ECN), which helps performance by reducing or eliminating packet loss during times of (light) congestion.

The earliest and best-known form of AQM on the Internet is Random Early Detection (RED). This is now supported by various routers and switches, although it is not typically activated by default. One possible reason is that RED, as originally specified, must be "tuned" depending on the given traffic mix (and optimisation goals) to be maximally effective. Various alternative methods have been proposed as improvements to RED, but none of them have enjoyed widespread use.

CoDel (May 2012) has been proposed as an promising practical alternative to RED. PIE was then proposed as an alternative to CoDel, claiming to be easier to implement efficiently, in particular on "hardware" implementations.

AQM in the IETF

RFC 2309, Recommendations on Queue Management and Congestion Avoidance in the Internet, (1998) recommended "testing, standardization, and widespread deployment" of AQM, and specifically RED, on the Internet. The testing part was certainly followed, in the sense that a huge number of academic papers was published about RED, its perceived shortcomings, proposed alternative AQMs, and so on. There was no standardization, and very little actual deployment. While RED is implemented in most routers today, it is generally not enabled by default, and very few operators explicitly enable it. There are many reasons for this, but an important one is that optimal configuration parameters for RED depend on traffic load and tradeoffs between various optimization goals (e.g. throughput and delay). RFC 3819, Advice for Internet Subnetwork Designers, also discusses questions of AQM, in particular RED and its configuration parameters.

In March 2013, a new AQM mailing list ( archive) was announced to discuss a possible replacement for RFC 2309. This evolved into the AQM Working Group. Fred Baker issued draft-baker-aqm-recommendation as a starting point. This became a Working Group document ( draft-ietf-aqm-recommendation) and was submitted to the IESG in February 2015 with Gorry Fairhurst as a co-editor.

References

  • RFC 2309, Recommendations on Queue Management and Congestion Avoidance in the Internet, B. Braden, D. Clark, J. Crowcroft, B. Davie, S. Deering, D. Estrin, S. Floyd, V. Jacobson, G. Minshall, C. Partridge, L. Peterson, K. Ramakrishnan, S. Shenker, J. Wroclawski, L. Zhang. April 1998
  • RFC 3819, Advice for Internet Subnetwork Designers, P. Karn, Ed., C. Bormann, G. Fairhurst, D. Grossman, R. Ludwig, J. Mahdavi, G. Montenegro, J. Touch, L. Wood. July 2004
  • RFC 7567, IETF Recommendations Regarding Active Queue Management, F. Baker, Ed., G. Fairhurst, Ed., July 2015
  • RFC 7806, On Queuing, Marking, and Dropping, F. Baker, R. Pan, April 2016
  • Advice on network buffering, G. Fairhurst, B. Briscoe, slides presented to ICCRG at IETF-86, March 2013
  • RFC 7928, Characterization Guidelines for Active Queue Management (AQM), N. Kuhn, Ed., P. Natarajan, Ed., N. Khademi, Ed., D. Ros, July 2016
  • draft-lauten-aqm-gsp-03, Global Synchronization Protection for Packet Queues, Wolfram Lautenschlaeger, May 2016
  • draft-briscoe-tsvwg-aqm-tcpm-rmcat-l4s-problem-01, Low Latency, Low Loss, Scalable Throughput (L4S) Internet Service: Problem Statement, Bob Briscoe, Koen De Schepper, Marcelo Bagnulo, June 2016
  • draft-ietf-tsvwg-aqm-dualq-coupled-05, DualQ Coupled AQMs for Low Latency, Low Loss and Scalable Throughput (L4S), Koen De Schepper, Bob Briscoe, Olga Bondarenko, Ing-jyh Tsang, July 2018
  • draft-briscoe-tsvwg-l4s-diffserv-01, Interactions between Low Latency, Low Loss, Scalable Throughput (L4S) and Differentiated Services, Bob Briscoe, July 2018

-- Simon Leinen - 2005-01-05 - 2018-07-04

Random Early Detection (RED)

RED is the first and most widely used Active Queue Management (AQM) mechanism. The basic idea is this: a route controlled by RED samples its occupation (queue size) over time, and computes a characteristic (often a decaying average such as an Exponentially Weighted Moving Average (EWMA), although several researchers propose that using the instantaneous queue length works better). This sampled queue length is then used to compute the loss probability for an incoming packet, according to a pre-defined profile. The decision whether to drop an incoming packet is performed using a random number and the drop probability. In this way, a point of (light) congestion in the network can run with a short queue, which is beneficial for applications because it keeps one-way delay and round-trip time low. When the congested link is shared by only a few TCP streams, RED also prevents synchronization effects that cause degradation of throughput.

Although RED is widely implemented in networking equipment, it is usually "off by default", and not often activated by operators. One reason is that it has various parameters (queue thresholds, EWMA parameters etc.) that need to be configured, and finding good values for these parameters is considered as something of a black art. Also, a huge number of research papers have been published that point out shortcomings of RED in particular scenarios, which discourages network operators from using it. Unfortunately, none of the many alternatives to RED that have been proposed in these publications, such as EXPRED, REM, or BLUE, has gained any more traction.

The BufferBloat activity has rekindled interest in queue-management strategies, and Kathleen Nichols and Van Jacobson, who have a long history of doing research on practical AQM algorithms, have published CoDel in 2012, which claims to improve upon RED in several key points using a fresh approach.

References

-- SimonLeinen - 2005-01-07 - 2012-07-12

Differentiated Services (DiffServ)

From the IETF's Differentiated Services (DiffServ) Work Group's site:

The differentiated services approach to providing quality of service in networks employs a small, well-defined set of building blocks from which a variety of aggregate behaviors may be built. A small bit-pattern in each packet, in the IPv4 TOS octet or the IPv6 Traffic Class octet, is used to mark a packet to receive a particular forwarding treatment, or per-hop behavior, at each network node.

Examples of DiffServ-based services include the Premium IP and "Less-than Best Effort" (LBE) services available on GEANT/GN2 and some NRENs.

References

-- TobyRodwell - 28 Feb 2005 -- SimonLeinen - 15 Jul 2005

Line: 142 to 141
 

Specific Network Technologies

This section of the knowledge base treats some (sub)network technologies with their specific performance implications.

-- SimonLeinen - 06 Dec 2005 -- BlazejPietrzak - 05 Sep 2007

Ethernet

Ethernet is now widely prevalent as a link-layer technology for local area/campus network, and is making inroads in other market segments as well, for example as a framing technique for wide-area connections (replacing ATM or SDH/SONET in some applications), or as fabric for storage networks (replacing Fibre Channel etc.) or clustered HPC systems (replacing special-purpose networks such as Myrinet etc.).

From the original media access protocol based on CSMA/CD used on shared coaxial cabling at speeds of 10 Mb/s (originally 3 Mb/s), Ethernet has involved to much higher speeds, from Fast Ethernet (100 Mb/s) through Gigabit Ethernet (1 Gb/s) to 10 Gb/s. Shared media access and bus topologies have been replaced by star-shaped topologies connected by switches. Additional extensions include speed, duplex-mode and cable-crossing auto-negotiation, virtual LANs (VLANs), flow-control and other quality-of-service enhancements, port-based access control and many more.

The topics treated here are mostly relevant for the "traditional" use of Ethernet in local (campus) networks. Some of these topics are relevant for non-Ethernet networks, but are mentioned here nevertheless because Ethernet is so widely used that they have become associated with it.

-- SimonLeinen - 15 Dec 2005

Duplex Modes and Auto-Negotiation

A point-to-point Ethernet segment (typically between a switch and an end-node, or between two directly connected end-nodes) can operate in one of two duplex modes: half duplex means that only one station can send at a time, and full duplex means that both stations can send at the same time. Of course full-duplex mode is preferable for performance reasons if both stations support it.

Duplex Mismatch

Duplex mismatch describes the situation where one station on a point-to-point Ethernet link (typically between a switch and a host or router) uses full-duplex mode, and the other uses half-duplex mode. A link with duplex mismatch will seem to work fine as long as there is little traffic. But when there is traffic in both directions, it will experience packet loss and severely decreased performance. Note that the performance in the duplex mismatch case will be much worse than when both stations operate in half-duplex mode.

Work in the Internet2 "End-to-End Performance Initiative" suggests that duplex mismatch is one of the most common causes of bad bulk throughput. Rich Carlson's NDT (Network Diagnostic Tester) uses heuristics to try to determine whether the path to a remote host suffers from duplex mismatch.

Duplex Auto-Negotiation

In early versions of Ethernet, only half-duplex mode existed, mostly because point-to-point Ethernet segments weren't all that common - typically an Ethernet would be shared by many stations, with the CSMA/CD (Collision Sense Multiple Access/Collision Detection) protocol used to arbitrate the sending channel.

When "Fast Ethernet" (100 Mb/s Ethernet) over twisted pair cable (100BaseT) was introduced, an auto-negotiation procedure was added to allow two stations and the ends of an Ethernet cable to agree on the duplex mode (and also to detect whether the stations support 100 Mb/s at all - otherwise communication would fall back to traditional 10 Mb/s Ethernet). Gigabit Ethernet over twisted pair (1000BaseTX) had speed, duplex, and even "crossed-cable" (MDX) autonegotiation from the start.

Why people turn off auto-negotiation

Unfortunately, some early products supporting Fast Ethernet didn't include the auto-negotiation mechanism, and those that did sometimes failed to interoperate with each other. So many knowledgeable people recommended to avoid the use of duplex-autonegotiation, because it introduced more problems than it solved. The common recommendation was thus to manually configure the desired duplex mode - typically full duplex by hand.

Problems with turning auto-negotiation off

There are two main problems with turning off auto-negotiation

  1. You have to remember to configure both ends consistently. Even when the initial configuration is consistent on both ends, it often turns into an inconsistent one as devices and connectinos are moved around.
  2. Hardcoding one side to full duplex when the other does autoconfiguration causes duplex mismatch. In situations where one side must use auto-negotiation (maybe because it is a non-manageable switch), it is never right to manually configure full-duplex mode on the other. This is because the auto-negotiation mechanism requires that, when the other side doesn't perform auto-negotiation, the local side must set itself to half-duplex mode.

Both situations result in duplex mismatches, with the associated performance issues.

Recommendation: Leave auto-negotiation on

In the light of these problems with hard-coded duplex modes, it is generally preferable to rely on auto-negotiation of duplex mode. Recent equipment handles auto-negotiation in a reliable and interoperable way, with very few exceptions.

References

-- SimonLeinen - 12 Jun 2005 - 4 Sep 2006

Deleted:
<
<

CSMA/CD

CSMA/CD stands for Carrier Sense Multiple Access/Collision Detection. It is a central part of the original Ethernet design and describes how multiple stations connected to a shared medium (such as a coaxial cable) can share the transmission channel without centralized control, using "carrier sensing" to avoid most collisions, and using collision detection with randomized backoff for those collisions that could not be avoided.

-- SimonLeinen - 15 Dec 2005

 

LAN Collisions

In some legacy networks, workstations or other devices may still be connected as into a LAN segment using hubs. All incoming and outgoing traffic is propagated throughout the entire hub, often resulting in a collision when two or more devices attempt to send data at the same time. For each collision, the original information will need to be resent, reducing performance.

Operationally, this can lead to up to 100% of 5000-byte packets being lost when sending traffic off network and 31%-60% packet loss within a single subnet. It should be noted that common applications (e.g. email, FTP, WWW) on LANs produce packets close to 1500 bytes in size, and that packet loss rates >1% render applications such as video conferencing unusable.

To prevent collisions from traveling to every workstation in the entire network, bridges or switches should be installed. These devices will not forward collisions, but will permit broadcasts to all users and multicasts to specific groups to pass through.

When only a single system is connected to a single switch port, each collision domain is made up of only one system, and full-duplex communication also becomes possible.

-- AnnRHarding - 18 Jul 2005

LAN Broadcast Domains

While switches help network performance by reducing collision domains, they will permit broadcasts to all users and multicasts to specific groups to pass through. In a switched network with a lot of broadcast traffic, network congestion can occur despite high speed backbones. As universities and colleges were often early adopters of Internet technologies, they may have large address allocations, perhaps even deployed as big flat networks which generate a lot of broadcast traffic. In some cases, these networks can be as large as a /16, and having the potential to put up to sixty five thousand hosts on a single network segment could be disastrous for performance.

The main purpose of subnetting is to help relieve network congestion caused by broadcast traffic. A successful subnetting plan is one where most of the network traffic will be isolated to the subnet in which it originated and broadcast domains are of a manageable size. This may be possible based on physical location, or it may be better to use VLANs. VLANs allow you to segment a LAN into different broadcast domains regardless of physical location. Users and devices on different floors or buildings have the ability to belong to the same LAN, since the segmentation is handled virtually and not via the physical layout.

References and Further Reading

RIPE Subnet Mask Information http://www.ripe.net/rs/ipv4/subnets.html

Priscilla Oppenheimer, Top-Down Network Design http://www.topdownbook.com/

-- AnnRHarding - 18 Jul 2005

Wireless LAN

Wireless Local Area Networks according to IEEE 802.11 standards has become extremely widespread in recent years, in campus networks, for home networking, for convenient network access at conferences, and to a certain point for commercial Internet access provision in hotels, public places, and even planes.

While wireless LANs are usually built more for convenience (or profit) than for performance, there are some interesting performance issues specific to WLANs. As an example, it is still a big challenge to build WLANs using multiple access points so that they can scale to large numbers of simultaneous users, e.g. for large events.

Common problems with 802.11 wireless LANs

Interference

In the 2.4 GHz band, the number of usable channels (frequency) is low. Adjacent channels use overlapping frequencies, so there are typically only three truly non-overlapping channels in this band - channels 1, 6, and 11 are frequently used. In campus networks requiring many access points, care must be taken to avoid interference between same-channel access points. The problem is even more severe in areas where access points are deployed without coordination (such as in residential areas). Some modern access points can sense the radio environment during initialization, and try to use a channel that doesn't suffer from much interference. The 2.4 GHz is also used by other technologies such as Bluetooth or microwave ovens.

Capacity loss due to backwards compatibility or slow stations

The radio link in 802.11 can work at many different data rates below the nominal rate. For example, the 802.11g (54 Mb/s) access point to which I am presently connected supports operation at 1, 2, 5.5, 6, 9, 11, 12, 48, 18, 24, 36, or 54 Mb/s. Using the lower speeds can be useful in terms of adverse radio transmission conditions. In addition, it allows backwards compatibility - for example, 802.11g equipment interoperates with older 802.11b equipment, albeit at most at the lower 11 Mb/s supported by 802.11b.

When lower-rate and higher-rate stations coexist on the same access point, it should be noticed that the lower-rate station will occupy disproportionally more of the medium's capacity, because of increased serialization times at lower rates. So a single station operating at 1 Mb/s and transferring data at 500 kb/s will consume an equal part of the access point's capacity as 54 stations also transferring 500 kb/s each, but at a 54 Mb/s wireless rate.

Multicast/broadcast

Wireless is a "natural" broadcast medium, so broadcast and multicast should be relatively efficient. But the access point normally sends multicast and broadcast frames at a low rate, to increase the probability that all stations can actually receive them. Thus, multicast traffic streams can quickly consume a large fraction of an access point's capacity as per the considerations in the preceding section.

This is a reason why wireless networks often aren't multicast-enabled even in environments that typically have multicast connectivity (such as campus networks). Note that broadcast and multicast cannot easily be disabled completely, because they are required for lower-layer protocols such as ARP (broadcast) or IPv6 Neighbor Discovery (multicast) for work.

802.11n Performance Features

IEEE 802.11n is an recent addition to the standards for wireless LAN offering higher performance in terms of both capacity ("bandwidth") and reach. The standard supports both bands, although "consumer" 802.11n products often work with 2.4 GHz only unless marked "dual-band". Within each band, 802.11n equipment is normally backwards compatible with the respective prior standards, i.e. 802.11b/g for 2.4 GHz and 802.11a for 5 GHz. 802.11n achieves performance increases by

  • physical diversity using multiple antennas in a "MIMO" multiple in/multiple out scheme
  • the option to use wider channels (spaced 40 MHz rather than 20 MHz)
  • frame aggregation options at the MAC levels, allowing larger packets or bundling of multiple frames to a single radio frame, in order to better amortize the (relatively high) link access overhead.

References

-- Simon Leinen - 2005-12-15 - 2018-02-04

Line: 156 to 154
  -- SimonLeinen - 05 Feb 2006
Added:
>
>
by sex

Revision 202007-08-17 - TWikiGuest

Revision 192006-04-07 - SimonLeinen

Line: 1 to 1
 
META TOPICPARENT name="AutoToc"

Unrolled PERTKB Contents

Line: 6 to 6
 

Latest News

Windows 7 / Vista / 2008 TCP/IP tweaks

On Speedguide.net there is a great article about the TCP/IP Stack tuning possibilities in Windows 7/Vista/2008.

NDT Protocol Documented

A description of the protocol used by the NDT tool has recently been written up. Check this out if you want to implement a new client (or server), or if you simply want to understand the mechanisms.

-- SimonLeinen - 02 Aug 2011

What does the "+1" mean?

I would like to experiment with social web techniques, so I have added a (Google) "+1" button to the template for viewing PERT KB topics. It allows people to publicly share interesting Web pages. Please share what you think about this via the SocialWebExperiment topic!

-- SimonLeinen - 02 Jul 2011

News from Web10G

The Web10G project has released its first code in May 2011. If you are interested in this followup project of the incredibly useful web100, check out their site. I have added a short summary on the project to the WebTenG topic here.

-- SimonLeinen - 30 Jun 2011

FCC Open Internet Apps Challenge - Vote for Measurement Tools by 15 July

The U.S. Federal Communications Commission (FCC) is currently holding a challenge for applications that can help assess the "openness" of Internet access offerings. There's a page on challenge.gov where people can vote for their favorite tool. A couple of candidate applications are familiar to our community, for example NDT and NPAD. But there are some new ones that look interesting. Some applications are specific for mobile data networks ("3G" etc.). There are eight submissions - be sure to check them out. Voting is possible until 15 July 2011.

-- SimonLeinen - 29 Jun 2011

Web10G

Catching up with the discussion@web100.org mailing list, I noticed an announcement by Matt Mathis (formerly PSC, now at Google) from last September (yes, I'm a bit behind with my mail reading :-): A follow-on project called Web10G has received funding from the NSF. I'm all excited about this because Web100 produced so many influential results. So congratulations to the proposers, good luck for the new project, and let's see what Web10G will bring!

-- SimonLeinen - 25 Mar 2011

eduPERT Training Workshop in Zurich, 18/19 November 2010

We held another successful training event in November, with 17 attendees from various NRENs including some countries that have joined GÉANT recently, or are in the process of doing so. The program was based on the one used for the workshops held in 2008 and 2007. But this time we added some case studies, where the instructors walked through some actual cases they had encountered in their PERT work, and at many points asked the audience for their interpretation of the symptoms seen so far, and for suggestions on what to do next. Thanks to the active participation of several attendees, this interactive part turned out to be both enjoyable and instructive, and was well received by the audience. Program and slides are available on the PERT Training Workshop page at TERENA.

-- SimonLeinen - 18 Dec 2010

Efforts to increase TCP's initial congestion window (updated)

At the 78th IETF meeting in Maastricht this July, one of the many topics discussed there was a proposal to further increase TCP's initial congestion window, possibly to something as big as ten maximum-size segments (10 MSS). This caused me to notice that the PERT KB was missing a topic on TCP's initial congestion window, so I created one. A presentation by Matt Mathis on the topic is scheduled for Friday's ICCRG meeting. The slides, as well as some other materials and minutes, can be found on the IETF Proceedings page. Edited to add: Around IETF 79 in November 2010, the discussion on the initial window picked up steam, and there are now three different proposals in the form of Internet-Drafts. I have updated the TcpInitialWindow topic to briefly describe these.

-- SimonLeinen - 28 Jul 2010 - 09 Dec 2010

PERT meeting at TNC 2010

We held a PERT meeting on Sunday 30 May (1400-1730), just before the official start of the TNC 2010 conference in Vilnius. The attendees discussed technical and organizational issues facing PERTs in National Research and Education Networks (NRENs), large campus networks, and supporting network-intensive projects such as the multi-national research efforts in (Radio-) Astronomy, Particle Physics, or Grid/Cloud research. Slides are available on TNC2010 website here.

-- SimonLeinen - 15 Jun 2010

Web100 Project Re-Opening Soon?

In a mail to the ndt-users list, Matt Mathis has issued a call for input from people who use Web100, as part of NDT or otherwise. If you use this, please send Matt an e-mail. There's a possibility that the project will be re-opened "for additional development and some bug fixes".

-- SimonLeinen - 20 Feb 2010

ICSI releases Netalyzr Internet connectivity tester

The International Computer Science Institute (ICSI) at Berkeley has released an awesome new java applet called Netalyzr which will let you do a whole bunch of connectivity tests. A very nice tool to get information about your Internet connectivity, to see if your provider blocks ports, your DNS works properly, and much more. Read about it on our page here or go directly to their homepage.

-- ChrisWelti - 18 Jan 2010

News Archive:

Notes About Performance

This part of the PERT Knowledgebase contains general observations and hints about the performance of networked applications.

The mapping between network performance metrics and user-perceived performance is complex and often non-intuitive.

-- SimonLeinen - 13 Dec 2004 - 07 Apr 2006

Why Latency Is Important

Traditionally, the metric of focus in networking has been bandwidth. As more and more parts of the Internet have their capacity upgraded, bandwidth is often not the main problem anymore. However, network-induced latency as measured in One-Way Delay (OWD) or Round-Trip Time often has noticeable impact on performance.

It's not just for gamers...

The one group of Internet users today that is most aware of the importance of latency are online gamers. It is intuitively obvious than in real-time multiplayer games over the network, players don't want to be put at a disadvantage because their actions take longer to reach the game server than their opponents'.

However, latency impacts the other users of the Internet as well, probably much more than they are aware of. At a given bottleneck bandwidth, a connection with lower latency will reach its achievable rate faster than one with higher latency. The effect of this should not be underestimated, since most connections on the Internet are short - in particular connections associated with Web browsing.

But even for long connections, latency often has a big impact on performance (throughput), because many of those long connections have their throughput limited by the window size that is available for TCP. And when the window size is the bottleneck, throughput is inversely proportional to round-trip time. Furthermore, for a given (small) loss rate, RTT places an upper limit on the achieveable throughput of a TCP connection, as shown by the Mathis Equation

But I thought latency was only important for multimedia!?

Common wisdom is that latency (and also jitter) is important for audio/video ("multimedia") applications. This is only partly true: Many applications of audio/video involve "on-demand" unidirectional transmission. In those applications, the real-time concerns can often be mitigated by clever buffering or transmission ahead of time. For conversational audio/video, such as "Voice over IP" or videoconferencing, the latency issue is very real. The principal sources of latency in these applications are not backbone-related, but related to compression/sampling rates (see packetization delay) and to transcoding devices such as H.323 MCUs (Multi-Channel Units).

References

-- SimonLeinen - 13 Dec 2004

Added:
>
>

The "Wizard Gap"

The Wizard Gap is an expression coined by Matt Mathis (then PSC) in 1999. It designates the difference between the performance that is "theoretically" possible on today's high-speed networks (in particular, research networks), and the performance that most users actually perceive. The idea is that today, the "theoretical" performance can only be (approximately) obtained by "wizards" with superior knowledge and skills concerning system tuning. Good examples for "Wizard" communities are the participants in Internet2 Land-Speed Record or SC Bandwidth Challenge competitions.

The Internet2 end-to-end performance initiative strives to reduce the Wizard Gap by user education as well as improved instrumentation (see e.g. Web100) of networking stacks. In the GÉANT community, PERTs focus on assistance to users (case management), as well as user education through resources such as this knowledge base. Commercial players are also contributing to closing the wizard gap, by improving "out-of-the-box" performance of hardware and software, so that their customers can benefit from faster networking.

References

  • M. Mathis, Pushing up Performance for Everyone, Presentation at the NLANR/Internet2 Joint Techs Workshop, 1999 (PPT)

-- SimonLeinen - 2006-02-28 - 2016-04-27

User-Perceived Performance

User-perceived performance of network applications is made up of a number of mainly qualitative metrics, some of which are in conflict with each other. In the case of some applications, a single metric will outweigh the others, such as responsiveness from video services or throughput for bulk transfer applications. More commonly, a combination of factors usually determines the experience of network performance for the end-user.

Responsiveness

One of the most important user experiences in networking applications is the perception of responsiveness. If end-users feel that an application is slow, it is often the case that it is slow to respond to them, rather than being directly related to network speed. This is a particular issue for real-time applications such as audio/video conferencing systems and must be prioritised in applications such as remote medical services and off-campus teaching facilities. It can be difficult to quantitatively define an acceptable figure for response times as the requirements may vary from application to application.

However, some applications have relatively well-defined "physiological" bounds beyond which the responsiveness feeling vanishes. For example, for voice conversations, a (round-trip) delay of 150ms is practically unnoticeable, but an only slightly larger delay is typically felt as very intrusive.

Throughput/Capacity/"Bandwidth"

Throughput per se is not directly perceived by the user, although a lack of throughput will increase waiting times and reduce the impression of responsiveness. However, "bandwidth" is widely used as a "marketing" metric to differentiate "fast" connections from "slow" ones, and many applications display throughput during long data transfers. Therefore users often have specific performance expectations in terms of "bandwidth", and are disappointed when the actual throughput figures they see is significantly lower than the advertised capacity of their network connection.

Reliability

Reliability is often the most important performance criterion for a user: The application must be available when the user requires it. Note that this doesn't necessarily mean that it is always available, although there are some applications - such as the public Web presence of a global corporation - that have "24x7" availability requirements.

The impression of reliability will also be heavily influenced by what happens (or is expected to happen) in cases of unavailability: Does the user have a possibility to build a workaround? In case of provider problems: Does the user have someone competent to call - or can they even be sure that the provider will notice the problem themselves, and fix it in due time? How is the user informed during the outage, in particular concerning the estimated time to repair?

Another aspect of reliability is the predictability of performance. It can be profoundly disturbing to a user to see large performance variations over time, even if the varying performance is still within the required performance range - who can guarantee that variations won't increase beyond the tolerable during some other time when the application is needed? E.g., a 10 Mb/s throughput that remains rock-stable over time can feel more reliable than throughput figures that vary between 200 and 600 Mb/s.

-- SimonLeinen - 07 Apr 2006
-- AlessandraScicchitano - 10 Feb 2012

Responsiveness

One of the most important user experiences in networking applications is the perception of responsiveness. If end-users feel that an application is slow, it is often the case that it is slow to respond to them, rather than being directly related to network speed. This is a particular issue for real-time applications such as audio/video conferencing systems and must be prioritised in applications such as remote medical services and off-campus teaching facilities. It can be difficult to quantitatively define an acceptable figure for response times as the requirements may vary from application to application.

However, some applications have relatively well-defined "physiological" bounds beyond which the responsiveness feeling vanishes. For example, for voice conversations, a (round-trip) delay of 150ms is practically unnoticeable, but an only slightly larger delay is typically felt as very intrusive.

-- SimonLeinen - 07 Apr 2006 - 23 Aug 2007

Throughput/Capacity/"Bandwidth"

Throughput per se is not directly perceived by the user, although a lack of throughput will increase waiting times and reduce the impression of responsiveness. However, "bandwidth" is widely used as a "marketing" metric to differentiate "fast" connections from "slow" ones, and many applications display throughput during long data transfers. Therefore users often have specific performance expectations in terms of "bandwidth", and are disappointed when the actual throughput figures they see is significantly lower than the advertised capacity of their network connection.

-- SimonLeinen - 07 Apr 2006

Reliability

Reliability is often the most important performance criterion for a user: The application must be available when the user requires it. Note that this doesn't necessarily mean that it is always available, although there are some applications - such as the public Web presence of a global corporation - that have "24x7" availability requirements.

The impression of reliability will also be heavily influenced by what happens (or is expected to happen) in cases of unavailability: Does the user have a possibility to build a workaround? In case of provider problems: Does the user have someone competent to call - or can they even be sure that the provider will notice the problem themselves, and fix it in due time? How is the user informed during the outage, in particular concerning the estimated time to repair?

Another aspect of reliability is the predictability of performance. It can be profoundly disturbing to a user to see large performance variations over time, even if the varying performance is still within the required performance range - who can guarantee that variations won't increase beyond the tolerable during some other time when the application is needed? E.g., a 10 Mb/s throughput that remains rock-stable over time can feel more reliable than throughput figures that vary between 200 and 600 Mb/s.

-- SimonLeinen - 07 Apr 2006

 

Network Performance Metrics

There are many metrics that are commonly used to characterize the performance of networks and parts of networks. We present the most important of these, explain what influences them, how they can be measured, how they influence end-to-end performance, and what can be done to improve them.

A framework for network performance metrics has been defined by the IETF's IP Performance Metrics (IPPM) Working Group in RFC 2330. The group also developed definitions for several specific performance metrics; those are referenced from the respective sub-topics.

References

-- SimonLeinen - 31 Oct 2004

One-way Delay (OWD)

One-way delay is the time it takes for a packet to reach its destination. It is considered a property of network links or paths. RFC 2679 contains the definition of the one-way delay metric of the IETF's IPPM (IP Performance Metrics) working group.

Decomposition

One-way delay along a network path can be decomposed into per-hop one-way delays, and these in turn into per-link and per-node delay components.

Per-link delay components: propagation delay and serialization delay

The link-specific component of one-way delay consists of two sub-components:

Propagation Delay is the time it takes for signals to move from the sending to the receiving end of the link. On simple links, this is the product of the link's physical length and the characteristical propagation speed. The velocity of propagation (VoP) for copper and fibre optic are similar (the VoP of copper is slightly faster), being approximately 2/3 the speed of light in a vacuum.

Serialization delay is the time it takes for a packet to be serialized into link transmission units (typically bits). It is the packet size (in bits) divided by the link's capacity (in bits per second).

In addition to the propagation and serialization delays, some types of links may introduce additional delays, for example to avoid collisions on a shared link, or when the link-layer transparently retransmits packets that were damaged during transmission.

Per-node delay components: forwarding delay, queueing delay

Within a network node such as a router, a packet experiences different kinds of delay between its arrival on one link and its departure on another (or the same) link:

Forwarding delay is the time it takes for the node to read forwarding-relevant information (typically the destination address and other headers) from the packet, compute the "forwarding decision" based on routing tables and other information, and to actually forward the packet towards the destination, which involves copying the packet to a different interface inside the node, rewriting parts of it (such as the IP TTL and any media-specific headers) and possibly other processing such as fragmentation, accounting, or checking access control lists.

Depending on router architecture, forwarding can compete for resources with other activities of the router. In this case, packets can be held up until the router's forwarding resource is available, which can take many milliseconds on a router with CPU-based forwarding. Routers with dedicated hardware for forwarding don't have this problem, although there may be delays when a packet arrives as the hardware forwarding table is being reprogrammed due to a routing change.

Queueing delay is the time a packet has to wait inside the node waiting for availability of the output link. Queueing delay depends on the amount of competing traffic towards the output link, and of the priorities of the packet itself and the competing traffic. The amount of queueing that a given packet may encounter is hard to predict, but it is bounded by the buffer size available for queueing.

There can be causes for queueing other than contention on the outgoing link, for example contention on the node's backplane interconnect.

Impact on end-to-end performance

When studying end-to-end performance, it is usually more interesting to look at the following metrics that are derived from one-way delay:

  • Round-trip time (RTT) is the time from node A to B and back to A. It is the sum of the one-way delays from A to B and from B to A, plus the response time in B.
  • Delay Variation represents the variation in one-way delay. It is important for real-time applications. People often call this "jitter".

Measurement

One-way delays from a node A to a node B can be measured by sending timestamped packets from A, and recording the reception times at B. The difficulty is that A and B need clocks that are synchronized to each other. This is typically achieved by having clocks synchronized to a standard reference time such as UTC (Universal Time Coordinated) using techniques such as Global Positioning System (GPS)-derived time signals or the Network Time Protocol (NTP).

There are several infrastructures that continuously measure one-way delays and packet loss: The HADES Boxes in DFN and GÉANT2; RIPE TTM Boxes between various research and commercial networks, and RENATER's QoSMetrics boxes.

OWAMP and RUDE/CRUDE are examples of tools that can be used to measure one-way delay.

Improving delay

Shortest-(Physical)-Path Routing and Proper Provisioning

On high-speed wide-area network paths, delay is usually dominated by propagation times. Therefore, physical routing of network links plays an important role, as well as the topology of the network and the selection of routing metrics. Ensuring minimal delays is simply a matter of

  • using an internal routing protocol with "shortest-path routing" (such as OSPF or IS-IS) and a metric proportional to per-link delay
  • provisioning the network so that these shortest paths aren't congested.

This could be paraphrased as "use proper provisioning instead of traffic engineering".

The node contributions to delay can be addressed by:

  • using nodes with fast forwarding
  • avoiding queueing by provisioning links to accomodate typical traffic bursts
  • reducing the number of forwarding nodes

References

-- SimonLeinen - 31 Oct 2004 - 25 Jan 2007

Round-Trip Time (RTT)

Round-trip time (RTT) is the total time for a packet sent by a node A to reach is destination B, and for a response sent back by B to reach A.

Decomposition

The round-trip time is the sum of the one-way delays from A to B and from B to A, and of the time it takes B to send the response.

Impact on end-to-end performance

For window-based transport protocols such as TCP, the round-trip time influences the achievable throughput at a given window size, because there can only be a window's worth of unacknowledged data in the network, and the RTT is the lower bound for a packet to be acknowledged.

For interactive applications such as conversational audio/video, instrument control, or interactive games, the RTT represents a lower bound on response time, and thus impacts responsiveness directly.

Measurement

The round-trip time is often measured with tools such as ping (Packet InterNet Groper) or one of its cousins such as fping, which send ICMP Echo requests to a destination, and measure the time until the corresponding ICMP Echo Response messages arrive.

However, please note that while round-trip time reported by PING is relatively precise measurement, some network devices may prioritize ICMP handling routines, so the measured values do not correspond to the real values.

-- MichalPrzybylski - 19 Jul 2005

Improving the round-trip time

The network components of RTT are the one-way delays in both directions (which can use different paths), so see the OneWayDelay topic on how those can be improved. The speed of response generation can be improved through upgrades of the responding host's processing power, or by optimizing the responding program.

-- SimonLeinen - 31 Oct 2004

Line: 18 to 23
 

Path MTU

The Path MTU is the Maximum Transmission Unit (MTU) supported by a network path. It is the minimum of the MTUs of the links (segments) that make up the path. Larger Path MTUs generally allow for more efficient data transfers, because source and destination hosts, as well as the switching devices (routers) along the network path have to process fewer packets. However, it should be noted that modern high-speed network adapters have mechanisms such as LSO (Large Send Offload) and Interrupt Coalescence that diminish the influence of MTUs on performance. Furthermore, routers are typically dimensioned to sustain very high packet loads (so that they can resist denial-of-service attacks) so the packet rates caused by high-speed transfers is not normally an issue for today's high-speed networks.

The prevalent Path MTU on the Internet is now 1500 bytes, the Ethernet MTU. There are some initiatives to support larger MTUs (JumboMTU) in networks, in particular on research networks. But their usability is hampered by last-mile issues and lack of robustness of RFC 1191 Path MTU Discovery.

Path MTU Discovery Mechanisms

Traditional (RFC1191) Path MTU Discovery

RFC 1191 describes a method for a sender to detect the Path MTU to a given receiver. (RFC 1981 describes the equivalent for IPv6.) The method works as follows:

  • The sending host will send packets to off-subnet destinations with the "Don't Fragment" bit set in the IP header.
  • The sending host keeps a cache containing Path MTU estimates per destination host address. This cache is often implemented as an extension of the routing table.
  • The Path MTU estimate for a new destination address is initialized to the MTU of the outgoing interface over which the destination is reached according to the local routing table.
  • When the sending host receives an ICMP "Too Big" (or "Fragmentation Needed and Don't Fragment Bit Set") destination-unreachable message, it learns that the Path MTU to that destination is smaller than previously assumed, and updates the estimate accordingly.
  • Normally, an ICMP "Too Big" message contains the next-hop MTU, and the sending host will use that as the new Path MTU estimate. The estimate can still be wrong because a subsequent link on the path may have an even smaller MTU.
  • For destination addresses with a Path MTU estimate lower than the outgoing interface MTU, the sending host will occasionally attempt to raise the estimate, in case the path has changed to support a larger MTU.
  • When trying ("probing") a larger Path MTU, the sending host can use a list of "common" MTUs, such as the MTUs associated with popular link layers, perhaps combined with popular tunneling overheads. This list can also be used to guess a smaller MTU in case an ICMP "Too Big" message is received that doesn't include any information about the next-hop MTU (maybe from a very very old router).

This method is widely implemented, but is not robust in today's Internet because it relies on ICMP packets sent by routers along the path. Such packets are often suppressed either at the router that should generate them (to protect its resources) or on the way back to the source, because of firewalls and other packet filters or rate limitations. These problems are described in RFC2923. When packets are lost due to MTU issues without any ICMP "Too Big" message, this is sometimes called a (MTU) black hole. Some operating systems have added heuristics to detect such black holes and work around them. Workarounds can include lowering the MTU estimate or disabling PMTUD for certain destinations.

Packetization Layer Path MTU Discovery (PLPMTUD, RFC 4821)

An IETF Working Group (pmtud) was chartered to define a new mechanism for Path MTU Discovery to solve these issues. This process resulted in RFC 4821, Packetization Layer Path MTU Discovery ("PLPMTUD"), which was published in March 2007. This scheme requires cooperation from a network layer above IP, namely the layer that performs "packetization". This could be TCP, but could also be a layer above UDP, let's say an RPC or file transfer protocol. PLPMTUD does not require ICMP messages. The sending packetization layer starts with small packets, and probes progressively larger sizes. When there's an indication that a larger packet was successfully transmitted to the destination (presumably because some sort of ACK was received), the Path MTU estimate is raised accordingly.

When a large packet was lost, this might have been due to an MTU limitation, but it might also be due to other causes, such as congestion or a transmission error - or maybe it's just the ACK that was lost! PLPMTUD recommends that the first time this happens, the sending packetization layer should assume an MTU issue, and try smaller packets. An isolated incident need not be interpreted as an indication of congestion.

An implementation of the new scheme for the Linux kernel was integrated into version 2.6.17. It is controlled by a "sysctl" value that can be observed and set through /proc/sys/net/ipv4/tcp_mtu_probing. Possible values are:

  • 0: Don't perform PLPMTUD
  • 1: Perform PLPMTUD only after detecting a "blackhole" in old-style PMTUD
  • 2: Always perform PLPMTUD, and use the value of tcp_base_mss as the initial MSS.

A user-space implementation over UDP is included in the VFER bulk transfer tool.

References

Implementations

  • for Linux 2.6 - integrated in mainstream kernel as of 2.6.17. However, it is disabled by default (see net.ipv4.tcp_mtu_probing sysctl)
  • for NetBSD

-- Hank Nussbacher - 2005-07-03 -- Simon Leinen - 2006-07-19 - 2018-07-02

Network Protocols

This section contains information about a few common Internet protocols, with a focus on performance questions.

For information about higher-level protocols, see under application protocols.

-- SimonLeinen - 2004-10-31 - 2015-04-26

TCP (Transmission Control Protocol)

The Transmission Control Protocol (TCP) is the prevalent transport protocol used on the Internet today. It uses window-based transmission to provide the service of a reliable byte stream, and adapts the rate of transfer to the state (of congestion) of the network and the receiver. Basic mechanisms include:

  • Segments that fit into IP packets, into which the byte-stream is split by the sender,
  • a checksum for each segment,
  • a Window, which bounds the amount of data "in flight" between the sender and the receiver,
  • Acknowledgements, by which the receiver tells the sender about segments that were received successfully,
  • a flow-control mechanism, which regulates data rate as a function of network and receiver conditions.

Originally specified in September 1981 RFC 793, TCP was clarified, refined and extended in many documents. Perhaps most importantly, congestion control was introduced in "TCP Tahoe" in 1988, described in Van Jacobson's 1988 SIGCOMM article on "Congestion Avoidance and Control". It can be said that TCP's congestion control is what keeps the Internet working when links are overloaded. In today's Internet, the enhanced "Reno" variant of congestion control is probably the most widespread.

RFC 7323 (formerly RFC 1323) specifies a set of "TCP Extensions for High Performance", namely the Window Scaling Option, which provides for much larger windows than the original 64K, the Timestamp Option and the PAWS (Protection Against Wrapped Sequence numbers) mechanism. These extensions are supported by most contemporary TCP stacks, although they frequently must be activated explicitly (or implicitly by configuring large TCP windows).

Another widely implemented performance enhancement to TCP are selective acknowledgements (SACK, RFC 2018). In TCP as originally specified, the acknowledgements ("ACKs") sent by a receiver were always "cumulative", that is, they specified the last byte of the part of the stream that was completely received. This is advantageous with large TCP windows, in particular where chances are high that multiple segments within a single window are lost. RFC 2883 describes an extension to SACK which makes TCP more robust in the presence of packet reordering in the network.

In addition, RFC 3449 - TCP Performance Implications of Network Path Asymmetry provides excellent information since a vast majority of Internet connections are asymmetrical.

Further Reading

References

  • RFC 7414, A Roadmap for Transmission Control Protocol (TCP) Specification Documents]], M. Duke, R. Braden, W. Eddy, E. Blanton, A. Zimmermann, February 2015
  • RFC 793, Transmission Control Protocol]], J. Postel, September 1981
  • draft-ietf-tcpm-rfc793bis-10, Transmission Control Protocol Specification, Wesley M. Eddy, July 2018
  • RFC 7323, TCP Extensions for High Performance, D. Borman, B. Braden, V. Jacobson, B. Scheffenegger, September 2014 (obsoletes RFC 1323 from May 1993)
  • RFC 2018, TCP Selective Acknowledgement Options]], M. Mathis, J. Mahdavi, S. Floyd, A. Romanow, October 1996
  • RFC 5681, TCP Congestion Control]], M. Allman, V. Paxson, E. Blanton, September 2009
  • RFC 3449, TCP Performance Implications of Network Path Asymmetry]], H. Balakrishnan, V. Padmanabhan, G. Fairhurst, M. Sooriyabandara, December 2002

There is ample literature on TCP, in particular research literature on its performance and large-scale behaviour.

Juniper has a nice white paper, Supporting Differentiated Service Classes: TCP Congestion Control Mechanisms (PDF format) explaining TCP's congestion control as well as many of the enhancements proposed over the years.

-- Simon Leinen - 2004-10-27 - 2018-07-02 -- Ulrich Schmid - 2005-05-31

Added:
>
>
-- TobyRodwell - 06 Apr 2005

  • Active open connection A TCP connection that a host has initiated i.e it sent the SYN
  • Passive open connection A TCP connection that is in state LISTENING
  • Pure ACK A TCP segment containing no data, just acknowledging sent data
  • Nagle algorithm The Nagle algorithm prevents the sending of a small segments if there are any outstanding unacknowledged data. This reduces the amount of tinygrams being sent. It is particularly elegant because on fast LAN ACKs will arrive quickly (such that the algorithm has minimal effect) whereas on a slow WAN the ACKs will arrive more slowly, so there will not be so many tinygrams. Users of high-speed networks, where capacity is not a significant issue but OWD is, may choose to disable the Nagle algorithm.
 

Window-Based Transmission

TCP is a sliding-window protocol. The receiver tells the sender the available buffer space at the receiver (TCP header field "window"). The total window size is the minimum of sender buffer size, advertised receiver window size and congestion window size.

The sender can transmit up to this amount of data before having to wait for further buffer update from the receiver and should not have more than this amount of data in transit in the network. The sender must buffer the sent data until it has been ACKed by the receiver, so that the data can be retransmitted if neccessary. For each ACK the sent segment left the window and a new segment fills the window if it fits the (possibly updated) window buffer.

Due to TCP's flow control mechanism, TCP window size can limit the maximum theoretical throughput regardless of the bandwidth of the network path. Using too small a TCP window can degrade the network performance lower than expected and a too large window may have the same problems in case of error recovery.

The TCP window size is the most important parameter for achieving maximum throughput across high-performance networks. To reach the maximum transfer rate, the TCP window should be no smaller than the bandwidth-delay product.

Window size (bytes) => Bandwidth (bytes/sec) x Round-trip time (sec)

Example:

window size: 8192 bytes
round-trip time: 100ms
maximum throughput: < 0.62 Mbit/sec.

References

<--   * Calculating TCP goodput, Javascript-based on-line calculator by V. Reijs, https://webmail.heanet.ie/~vreijs/tcpcalculations.htm -->

-- UlrichSchmid & SimonLeinen - 31 May-07 Jun 2005

Large TCP Windows

In order to achieve high data rates with TCP over "long fat networks", i.e. network paths with a large bandwidth-delay product, TCP sinks (that is, hosts receiving data transported by TCP) must advertise a large TCP receive window (referred to as just 'the window', since there is not an equivalent advertised 'send window').

The window is a 16 bit value (bytes 15 and 16 in the TCP header) and so, in TCP as originally specified, it is limited to a value of 65535 (64K). The receive window sets an upper limit on the sustained throughput achieveable over a TCP connection since it represents the maximum amount of unacknowledged data (in bytes) there can be on the TCP path. Mathematically, achieveable throughput can never be more than WINDOW_SIZE/RTT, so for a trans-Atlantic link, with say an RTT (Round trip Time) of 150ms, the throughput is limited to a maximum of 3.4Mbps. With the emergence of "long fat networks", the limit of 64K bytes (on some systems even just 32K bytes!) was clearly insufficient and so RFC 7323 laid down (amongst other things) a way of scaling the advertised window, such that the 16-bit window value can represent numbers larger than 64K.

RFC 7323 extensions

RFC 7323, TCP Extensions for High Performance, (and formerly RFC 1323) defines several mechanisms to enable high-speed transfers over LFNs: Window Scaling, TCP Timestamps, and Protection Against Wrapped Sequence numbers (PAWS).

The TCP window scaling option increases the maximum window size fom 64KB to 1Gbyte, by shifting the window field left by up to 14. The window scale option is used only during the TCP 3-way handshake (both sides must set the window scale option in their SYN segments if scaling is to be used in either direction).

It is important to use TCP timestamps option with large TCP windows. With the TCP timestamps option, each segment contains a timestamp. The receiver returns that timestamp in each ACK and this allows the sender to estimate the RTT. On the other hand with the TCP timestamps option the problem of wrapped sequence number could be solved (PAWS - Protection Against Wrapped Sequences) which could occur with large windows.

(Auto) Configuration

In the past, most operating systems required manual tuning to use large TCP windows. The OS-specific tuning section contains information on how to to this for a set of operating systems.

Since around 2008-2010, many popular operating systems will use large windows and the necessary protocol options by default, thanks to TCP Buffer Auto-Tuning.

Can TCP Windows ever be too large?

There are several potential issues when TCP Windows are larger than necessary:

  1. When there are many active TCP connection endpoints (sockets) on a system - such as a popular Web or file server - then a large TCP window size will lead to high consumption of system (kernel) memory. This can have a number of negative consequences: The system may run out of buffer space so that no new connections can be opened, or the high occupation of kernel memory (which typically must reside in actual RAM and cannot be "paged out" to disk) can "starve" other processes of access to fast memory (cache and RAM)
  2. Large windows can cause large "bursts" of consecutive segments/packets. When there is a bottleneck in the path - because of a slower link or because of cross-traffic - these bursts will fill up buffers in the network device (router or switch) in front of that bottleneck. The larger these bursts, the higher are the risks that this buffer overflows and causes multiple segments to be dropped. So a large window can lead to "sawtooth" behavior and worse link utilisation than with a just-big-enough window where TCP could operate at a steady rate.

Several methods for automatic TCP buffer tuning have been developed to resolve these issues, some of which have been implemented in (at least) recent Linux versions.

References

-- SimonLeinen - 27 Oct 2004 - 27 Sep 2014

TCP Acknowledgements

In TCP's sliding-window scheme, the receiver acknowledges the data it receives, so that the sender can advance the window and send new data. As originally specified, TCP's acknowledgements ("ACKs") are cumulative: the receiver tells the sender how much consecutive data it has received. More recently, selective acknowledgements were introduced to allow more fine-grained acknowledgements of received data.

Delayed Acknowledgements and "ack every other"

RFC 831 first suggested a delayed acknowledgement (Delayed ACK) strategy, where a receiver doesn't always immediately acknowledge segments as it receives them. This recommendation was carried forth and specified in more detail in RFC 1122 and RFC 5681 (formerly known as RFC 2581). RFC 5681 mandates that an acknowledgement be sent for at least every other full-size segment, and that no more than 500ms expire before any segment is acknowledged.

The resulting behavior is that, for longer transfers, acknowledgements are only sent for every two segments received ("ack every other"). This is in order to reduce the amount of reverse flow traffic (and in particular the number of small packets). For transactional (request/response) traffic, the delayed acknowledgement strategy often makes it possible to "piggy-back" acknowledgements on response data segments.

A TCP segment which is only an acknowledgement i.e. has no payload, is termed a pure ack.

Delayed Acknowledgments should be taken into account when doing RTT estimation. As an illustration, see this change note for the Linux kernel from Gavin McCullagh.

Critique on Delayed ACKs

John Nagle nicely explains problems with current implementations of delayed ACK in a comment to a thread on Hacker News:

Here's how to think about that. A delayed ACK is a bet. You're betting that there will be a reply, upon with an ACK can be piggybacked, before the fixed timer runs out. If the fixed timer runs out, and you have to send an ACK as a separate message, you lost the bet. Current TCP implementations will happily lose that bet forever without turning off the ACK delay. That's just wrong.

The right answer is to track wins and losses on delayed and non-delayed ACKs. Don't turn on ACK delay unless you're sending a lot of non-delayed ACKs closely followed by packets on which the ACK could have been piggybacked. Turn it off when a delayed ACK has to be sent.

Duplicate Acknowledgements

A duplicate acknowledgement (DUPACK) is one with the same acknowledgement number as its predecessor - it signifies that the TCP receiver has received a segment newer than the one it was expecting i.e. it has missed a segment. The missed segment might not be lost, it might just be re-ordered. For this reason the TCP sender will not assume data loss on the first DUPACK but (by default) on the third DUPACK, when it will, as per RFC 5681, perform a "Fast Retransmit", by sending the segment again without waiting for a timeout. DUPACKs are never delayed; they are sent immediately the TCP receiver detects an out-of-order segment.

"Quickack mode" in Linux

The TCP implementation in Linux has a special receive-side feature that temporarily disables delayed acknowledgements/ack-every-other when the receiver thinks that the sender is in slow-start mode. This speeds up the slow-start phase when RTT is high. This "quickack mode" can also be explicitly enabled or disabled with setsockopt() using the TCP_QUICKACK option. The Linux modification has been criticized because it makes Linux' slow-start more aggressive than other TCPs' (that follow the SHOULDs in RFC 5681), without sufficient validation of its effects.

"Stretch ACKs"

Techniques such as LRO and GRO (and to some level, Delayed ACKs as well as ACKs that are lost or delayed on the path) can cause ACKs to cover many more than the two segments suggested by the historic TCP standards. Although ACKs in TCP have always been defined as "cumulative" (with the exception of SACKs), some congestion control algorithms have trouble with ACKs that are "stretched" in this way. In January 2015, a patch set was submitted to the Linux kernel network development with the goal to improve the behavior of the Reno and CUBIC congestion control algorithms when faced with stretch ACKs.

References

  • TCP Congestion Control, RFC 5681, M. Allman, V. Paxson, E. Blanton, September 2009
  • RFC 813, Window and Acknowledgement Strategy in TCP, D. Clark, July 1982
  • RFC 1122, Requirements for Internet Hosts -- Communication Layers, R. Braden (Ed.), October 1989

-- TobyRodwell - 2005-04-05
-- SimonLeinen - 2007-01-07 - 2015-01-28

Added:
>
>
To improve the TCP performance many enhancements on TCP were done in the last years. E.g. implementation of improved TCP flow control algorithms (larger default buffer size, fast retransmit and recovery support, path-MTU discovery) in the TCP stack and development of high speed TCP variants.
Additional TCP options were introduced to increase the overall TCP throughput as:

-- UlrichSchmid - 10 Jun 2005

 

TCP Window Scaling Option

TCP as originally specified only allows for windows up to 65536 bytes, because the window-related protocol fields all have a width of 16 bits. This prevents the use of large TCP windows, which are necessary to achieve high transmission rates over paths with a large bandwidth*delay product.

The Window Scaling Option is one of several options defined in RFC 7323, "TCP Extensions for High Performance". It allows TCP to advertise and use a window larger than 65536 bytes. The way this works is that in the initial handshake, the a TCP speaker announces a scaling factor, which is a power of two between 20 (no scaling) and 214, to allow for an effective window of 230 (one Gigabyte). Window Scaling only comes into effect when both ends of the connection advertise the option (even with just a scaling factor of 20).

As explained under the large TCP windows topic, Window Scaling is typically used in connection with the TCP Timestamp option and the PAWS (Protection Against Wrapped Sequence numbers), both defined in RFC 7323.

32K limitations on some systems when Window Scaling is not used

Some systems, notably Linux, limit the usable window in the absence of Window Scaling to 32 Kilobytes. The reasoning behind this is that some (other) systems used in the past were broken by larger windows, because they erroneously interpreted the 16-bit values related to TCP windows as signed, rather than unsigned, integers.

This limitation can have the effect that users (or applications) that request window sizes between 32K and 64K, but don't have Window Scaling enabled, do not actually benefit from the desired window sizes.

The artificial limitation was finally removed from the Linux kernel sources on March 21, 2006. The old behavior (window limited to 32KB when Window Scaling isn't used) can still be enabled through a sysctl tuning parameter, in case one really wants to interoperate with TCP implementations that are broken in the way described above.

Problems with window scaling

Middlebox issues

MiddleBoxes which do not understand window scaling may cause very poor performance, as described in WindowScalingProblems. Windows Vista limits the scaling factor for HTTP transactions to 2 to avoid some of the problems, see WindowsOSSpecific.

Diagnostic issues

The Window Scaling option has an impact on how the offered window in TCP ACKs should be interpreted. This can cause problems when one is faced with incomplete packet traces that lack the initial SYN/SYN+ACK handshake where the options are negotiated: The SEQ/ACK sequence may look incorrect because scaling cannot be taken into account. One effect of this is that tcpgraph or similar analysis tools (included in tools such as WireShark) may produce incoherent results.

References

  • RFC 7323, TCP Extensions for High Performance, D. Borman, B. Braden, V. Jacobson, B. Scheffenegger, September 2014 (obsoletes RFC 1323 from May 1993)
  • git note about the Linux kernel modification that got rid of the 32K limit, R. Jones and D.S. Miller, March 2006. Can also be retrieved as a patch.

-- SimonLeinen - 21 Mar 2006 - 27 September 2014
-- PekkaSavola - 07 Nov 2006
-- AlexGall - 31 Aug 2007 (added reference to scaling limitiation in Windows Vista)

TCP Flow Control

Note: This topic describes the Reno enhancement of classical "Van Jacobson" or Tahoe congestion control. There have been many suggestions for improving this mechanism - see the topic on high-speed TCP variants.

TCP flow control and window size adjustment is mainly based on two key mechanism: Slow Start and Additive Increase/Multiplicative Decrease (AIMD), also known as Congestion Avoidance. (RFC 793 and RFC 5681)

Slow Start

To avoid that a starting TCP connection floods the network, a Slow Start mechanism was introduced in TCP. This mechanism effectively probes to find the available bandwidth.

In addition to the window advertised by the receiver, a Congestion Window (cwnd) value is used and the effective window size is the lesser of the two. The starting value of the cwnd window is set initially to a value that has been evolving over the years, the TCP Initial Window. After each acknowledgment, the cwnd window is increased by one MSS. By this algorithm, the data rate of the sender doubles each round-trip time (RTT) interval (actually, taking into account Delayed ACKs, rate increases by 50% every RTT). For a properly implemented version of TCP this increase continues until:

  • the advertised window size is reached
  • congestion (packet loss) is detected on the connection.
  • there is no traffic waiting to take advantage of an increased window (i.e. cwnd should only grow if it needs to)

When congestion is detected, the TCP flow-control mode is changed from Slow Start to Congestion Avoidance. Note that some TCP implementations maintain cwnd in units of bytes, while others use units of full-sized segments.

Congestion Avoidance

Once congestion is detected (through timeout and/or duplicate ACKs), the data rate is reduced in order to let the network recover.

Slow Start uses an exponential increase in window size and thus also in data rate. Congestion Avoidance uses a linear growth function (additive increase). This is achieved by introducing - in addition to the cwnd window - a slow start threshold (ssthresh).

As long as cwnd is less than ssthresh, Slow Start applies. Once ssthresh is reached, cwnd is increased by at most one segment per RTT. The cwnd window continues to open with this linear rate until a congestion event is detected.

When congestion is detected, ssthresh is set to half the cwnd (or to be strictly accurate, half the "Flight Size". This distinction is important if the implementation lets cwnd grow beyond rwnd (the receiver's declared window)). cwnd is either set to 1 if congestion was signalled by a timeout, forcing the sender to enter Slow Start, or to ssthresh if congestion was signalled by duplicate ACKs and the Fast Recovery algorithm has terminated. In either case, once the sender enters Congestion Avoidance, its rate has been reduced to half the value at the time of congestion. This multiplicative decrease causes the cwnd to close exponentially with each detected loss event.

Fast Retransmit

In Fast Retransmit, the arrival of three duplicate ACKs is interpreted as packet loss, and retransmission starts before the retransmission timer (RTO) expires.

The missing segment will be retransmitted immediately without going through the normal retransmission queue processing. This improves performance by eliminating delays that would suspend effective data flow on the link.

Fast Recovery

Fast Recovery is used to react quickly to a single packet loss. In Fast recovery, the receipt of 3 duplicate ACKs, while being taken to mean a loss of a segment, does not result in a full Slow Start. This is because obviously later segments got through, and hence congestion is not stopping everything. In fast recovery, ssthresh is set to half of the current send window size, the missing segment is retransmitted (Fast Retransmit) and cwnd is set to ssthresh plus three segments. Each additional duplicate ACK indicates that one segment has left the network at the receiver and cwnd is increased by one segment to allow the transmission of another segment if allowed by the new cwnd. When an ACK is received for new data, cwmd is reset to the ssthresh, and TCP enters congestion avoidance mode.

References

-- UlrichSchmid - 07 Jun 2005 -- SimonLeinen - 27 Jan 2006 - 23 Jun 2011

High-Speed TCP Variants

There have been numerous ideas for improving TCP over the years. Some of those ideas have been adopted by mainstream operations (after thorough review). Recently there has been an uptake in work towards improving TCP's behavior with LongFatNetworks. It has been proven that the current congestion control algorithms limit the efficiency in network resource utilization. The various types of new TCP implementations introduce changes in defining the size of congestion window. Some of them require extra feedback from the network. Generally, all such congestion control protocols are divided as follows.

An orthogonal technique for improving TCP's performance is automatic buffer tuning.

Comparative Studies

All papers about individual TCP-improvement proposals contain comparisons against older TCP to quantify the improvement. There are several studies that compare the performance of the various new TCP variants, including

Other References

  • Gigabit TCP, G. Huston, The Internet Protocol Journal, Vol. 9, No. 2, June 2006. Contains useful descriptions of many modern TCP enhancements.
  • Faster, G. Huston, June 2005. This article from ISP Column looks at various approaches to TCP transfers at very high rates.
  • Congestion Control in the RFC Series, M. Welzl, W. Eddy, July 2006, Internet-Draft (work in progress)
  • ICCRG Wiki, IRTF (Internet Research Task Force) ICCRG (Internet Congestion Control Research Group), includes bibliography on congestion control.
  • RFC 6077: Open Research Issues in Internet Congestion Control, D. Papadimitriou (Ed.), M. Welzl, M. Scharf, B. Briscoe, February 2011

Related Work

  • PFLDnet (International Workshop on Protocols for Fast Long-Distance Networks): 2003, 2004, 2005, 2006, 2007, 2009, 2010. (The 2008 pages seem to have been removed from the Net.)

-- Simon Leinen - 2005-05-28–2016-09-18 -- Orla Mc Gann - 2005-10-10 -- Chris Welti - 2008-02-25

Line: 99 to 106
 

Network Emulation Tools

In general, it is difficult to assess performance of distributed applications and protocols before they are deployed on the real network, because the interaction with network impairments such as delay is hard to predict. Therefore, researchers and practitioners often use emulation to mimic deployment (typically wide-area) networks for use in laboratory testbeds.

The emulators listed below implement logical interfaces with configurable parameters such as delay, bandwidth (capacity) and packet loss rates.

-- TobyRodwell - 06 Apr 2005 -- SimonLeinen - 15 Dec 2005

Linux netem (introduced in recent 2.6 kernels)

NISTnet works only on 2.4.x Linux kernels, so for having the best support for recent GE cards, I would recommend kernel 2.6.10 or .11 and the new netem (Network emulator). A lot of similar functionality is built on this special QoS queue. It is buried into menuconfig:

 Networking -->
   Networking Options -->
     QoS and/or fair queuing -->
        Network emulator

You will need a recent iproute2 package that supports netem. The tc command is used to configure netem characteristics such as loss rates and distribution, delay and delay variation, and packet corruption rate (packet corruption was added in Linux version 2.6.16).

References

-- TobyRodwell - 14 Apr 2005
-- SimonLeinen - 21 Mar 2006 - 07 Jan 2007 -- based on information from David Martinez Moreno (RedIRIS). Larry Dunn from Cisco has set up a portable network emulator based on NISTnet. He wrote this on it:

My systems started as 2x100BT, but not for the (cost) reasons you suspect. I have 3 cheap Linux systems in a mobile rack (3x1RU, <$5k USD total). They are attached in a "row", 2 with web100, 1 with NISTNet like this: web100 - NISTNet - web100. The NISTNet node acts as a variable delay-line (emulates different fiber propagation delays), and as a place to inject loss, if desired. DummyNet/BSD could work as well.

I was looking for a very "shallow" (front-to-back) 1RU system, so I could fit it into a normal-sized Anvil case (for easy shipping to Networkers, etc). Penguin Computing was one of 2 companies I found that was then selling such "shallow" rack-mount systems. The shallow units were "low-end" servers, so they only had 2x100BT ports.

Their current low-end systems have two 1xGE ports, and are (still) around $1100 USD each. So a set of 3 systems (inc. rack), Linux preloaded, 3-yr warranty, is still easily under $5k (Actaully, closer to $4k, including a $700 Anvil 2'x2'x2' rolling shock-mount case, and a cheap 8-port GE switch). Maybe $5k if you add more memory & disk...

Here's a reference to their current low-end stuff:

http://penguincomputing.com/products/servers/relion1XT.php?PHPSESSID=f90f88b4d54aad4eedc2bdae33796032

I have no affiliation with Penguin, just happened to buy their stuff, and it's working OK (though the fans are pretty loud). I suppose there are comparable vendors in Europe.

-- TobyRodwell - 14 Apr 2005

Added:
>
>

dummynet

A flexible tool for testing networking protocols, dummynet has been included in FreeBSD since 1998. A description of dummynet can be found on its homepage.

References

-- SimonLeinen - 15 Dec 2005

Sun hxbt STREAMS Module

From an announcement on the OpenSolaris Networking community site:

Hxbt is a Stream module/driver which emulates a WAN environment. It captures packets from IP and manipulates the packets according to the WAN setup. There are five parameters to control the environment. They are network propagation delay, bandwidth, drop rate, reordering, and corruption. Hxbt acts like a "pipe" between two hosts. And the pipe is the emulated WAN.

For more information download the source or take a look at the README.

-- SimonLeinen - 15 Dec 2005

 

End-System (Host) Tuning

This section contains hints for tuning end systems (hosts) for maximum performance.

References

-- SimonLeinen - 31 Oct 2004 - 23 Jul 2009
-- PekkaSavola - 17 Nov 2006

Calculating required TCP buffer sizes

Regardless of operating system, TCP buffers must be appropriately dimensioned if they are to support long distance, high data rate flows. Victor Reijs's TCP Throughput calculator is able to determine the maximum achievable TCP throughput based on RTT, expected packet loss, segment size, re-transmission timeout (RTO) and buffer size. It can be found at:

http://people.heanet.ie/~vreijs/tcpcalculations.htm

Tolerance to Packet Re-ordering

As explained in the section on packet re-ordering, packets can be re-ordered wherever there is parellelism in a network. If such parellelism exists in a network (e.g. as caused by the internal architectire of the Juniper M160s used in the GEANT network) then TCP should be set to tolerate a larger number of re-ordered packets than the default, which is 3 (see here for an example for Linux). For large flows across GEANT a value of 20 should be sufficient - an alternative work-round is to use multiple parallel TCP flows.

Modifying TCP Behaviour

TCP Congestion control algorithms are often configured conservatively, to avoid 'slow start' overwhelming a network path. In a capacity rich network this may not be necessary, and may be over-ridden.

-- TobyRodwell - 24 Jan 2005

-- TobyRodwell - 24 Jan 2005

[The following information is courtesy of Jeff Boote, from Internet2, and was forwarded to me by Nicolas Simar.]

Operating Systems (OSs) have a habit of reconfiguring network interface settings back to their default, even when the correct values have been written in a specific config file. This is the result of bugs, and they appear in almost all OSs. Sometimes they get fixed in a given release but then get broken again in a later release. It is not known why this is the case but it may be partly that driver programmers don't test their products under conditions of large latency. It is worth noting that experience shows 'ifconfig' works well for ''txqueuelength' and 'MTU'.

Line: 133 to 142
 

Specific Network Technologies

This section of the knowledge base treats some (sub)network technologies with their specific performance implications.

-- SimonLeinen - 06 Dec 2005 -- BlazejPietrzak - 05 Sep 2007

Ethernet

Ethernet is now widely prevalent as a link-layer technology for local area/campus network, and is making inroads in other market segments as well, for example as a framing technique for wide-area connections (replacing ATM or SDH/SONET in some applications), or as fabric for storage networks (replacing Fibre Channel etc.) or clustered HPC systems (replacing special-purpose networks such as Myrinet etc.).

From the original media access protocol based on CSMA/CD used on shared coaxial cabling at speeds of 10 Mb/s (originally 3 Mb/s), Ethernet has involved to much higher speeds, from Fast Ethernet (100 Mb/s) through Gigabit Ethernet (1 Gb/s) to 10 Gb/s. Shared media access and bus topologies have been replaced by star-shaped topologies connected by switches. Additional extensions include speed, duplex-mode and cable-crossing auto-negotiation, virtual LANs (VLANs), flow-control and other quality-of-service enhancements, port-based access control and many more.

The topics treated here are mostly relevant for the "traditional" use of Ethernet in local (campus) networks. Some of these topics are relevant for non-Ethernet networks, but are mentioned here nevertheless because Ethernet is so widely used that they have become associated with it.

-- SimonLeinen - 15 Dec 2005

Duplex Modes and Auto-Negotiation

A point-to-point Ethernet segment (typically between a switch and an end-node, or between two directly connected end-nodes) can operate in one of two duplex modes: half duplex means that only one station can send at a time, and full duplex means that both stations can send at the same time. Of course full-duplex mode is preferable for performance reasons if both stations support it.

Duplex Mismatch

Duplex mismatch describes the situation where one station on a point-to-point Ethernet link (typically between a switch and a host or router) uses full-duplex mode, and the other uses half-duplex mode. A link with duplex mismatch will seem to work fine as long as there is little traffic. But when there is traffic in both directions, it will experience packet loss and severely decreased performance. Note that the performance in the duplex mismatch case will be much worse than when both stations operate in half-duplex mode.

Work in the Internet2 "End-to-End Performance Initiative" suggests that duplex mismatch is one of the most common causes of bad bulk throughput. Rich Carlson's NDT (Network Diagnostic Tester) uses heuristics to try to determine whether the path to a remote host suffers from duplex mismatch.

Duplex Auto-Negotiation

In early versions of Ethernet, only half-duplex mode existed, mostly because point-to-point Ethernet segments weren't all that common - typically an Ethernet would be shared by many stations, with the CSMA/CD (Collision Sense Multiple Access/Collision Detection) protocol used to arbitrate the sending channel.

When "Fast Ethernet" (100 Mb/s Ethernet) over twisted pair cable (100BaseT) was introduced, an auto-negotiation procedure was added to allow two stations and the ends of an Ethernet cable to agree on the duplex mode (and also to detect whether the stations support 100 Mb/s at all - otherwise communication would fall back to traditional 10 Mb/s Ethernet). Gigabit Ethernet over twisted pair (1000BaseTX) had speed, duplex, and even "crossed-cable" (MDX) autonegotiation from the start.

Why people turn off auto-negotiation

Unfortunately, some early products supporting Fast Ethernet didn't include the auto-negotiation mechanism, and those that did sometimes failed to interoperate with each other. So many knowledgeable people recommended to avoid the use of duplex-autonegotiation, because it introduced more problems than it solved. The common recommendation was thus to manually configure the desired duplex mode - typically full duplex by hand.

Problems with turning auto-negotiation off

There are two main problems with turning off auto-negotiation

  1. You have to remember to configure both ends consistently. Even when the initial configuration is consistent on both ends, it often turns into an inconsistent one as devices and connectinos are moved around.
  2. Hardcoding one side to full duplex when the other does autoconfiguration causes duplex mismatch. In situations where one side must use auto-negotiation (maybe because it is a non-manageable switch), it is never right to manually configure full-duplex mode on the other. This is because the auto-negotiation mechanism requires that, when the other side doesn't perform auto-negotiation, the local side must set itself to half-duplex mode.

Both situations result in duplex mismatches, with the associated performance issues.

Recommendation: Leave auto-negotiation on

In the light of these problems with hard-coded duplex modes, it is generally preferable to rely on auto-negotiation of duplex mode. Recent equipment handles auto-negotiation in a reliable and interoperable way, with very few exceptions.

References

-- SimonLeinen - 12 Jun 2005 - 4 Sep 2006

Added:
>
>

CSMA/CD

CSMA/CD stands for Carrier Sense Multiple Access/Collision Detection. It is a central part of the original Ethernet design and describes how multiple stations connected to a shared medium (such as a coaxial cable) can share the transmission channel without centralized control, using "carrier sensing" to avoid most collisions, and using collision detection with randomized backoff for those collisions that could not be avoided.

-- SimonLeinen - 15 Dec 2005

 

LAN Collisions

In some legacy networks, workstations or other devices may still be connected as into a LAN segment using hubs. All incoming and outgoing traffic is propagated throughout the entire hub, often resulting in a collision when two or more devices attempt to send data at the same time. For each collision, the original information will need to be resent, reducing performance.

Operationally, this can lead to up to 100% of 5000-byte packets being lost when sending traffic off network and 31%-60% packet loss within a single subnet. It should be noted that common applications (e.g. email, FTP, WWW) on LANs produce packets close to 1500 bytes in size, and that packet loss rates >1% render applications such as video conferencing unusable.

To prevent collisions from traveling to every workstation in the entire network, bridges or switches should be installed. These devices will not forward collisions, but will permit broadcasts to all users and multicasts to specific groups to pass through.

When only a single system is connected to a single switch port, each collision domain is made up of only one system, and full-duplex communication also becomes possible.

-- AnnRHarding - 18 Jul 2005

LAN Broadcast Domains

While switches help network performance by reducing collision domains, they will permit broadcasts to all users and multicasts to specific groups to pass through. In a switched network with a lot of broadcast traffic, network congestion can occur despite high speed backbones. As universities and colleges were often early adopters of Internet technologies, they may have large address allocations, perhaps even deployed as big flat networks which generate a lot of broadcast traffic. In some cases, these networks can be as large as a /16, and having the potential to put up to sixty five thousand hosts on a single network segment could be disastrous for performance.

The main purpose of subnetting is to help relieve network congestion caused by broadcast traffic. A successful subnetting plan is one where most of the network traffic will be isolated to the subnet in which it originated and broadcast domains are of a manageable size. This may be possible based on physical location, or it may be better to use VLANs. VLANs allow you to segment a LAN into different broadcast domains regardless of physical location. Users and devices on different floors or buildings have the ability to belong to the same LAN, since the segmentation is handled virtually and not via the physical layout.

References and Further Reading

RIPE Subnet Mask Information http://www.ripe.net/rs/ipv4/subnets.html

Priscilla Oppenheimer, Top-Down Network Design http://www.topdownbook.com/

-- AnnRHarding - 18 Jul 2005

Wireless LAN

Wireless Local Area Networks according to IEEE 802.11 standards has become extremely widespread in recent years, in campus networks, for home networking, for convenient network access at conferences, and to a certain point for commercial Internet access provision in hotels, public places, and even planes.

While wireless LANs are usually built more for convenience (or profit) than for performance, there are some interesting performance issues specific to WLANs. As an example, it is still a big challenge to build WLANs using multiple access points so that they can scale to large numbers of simultaneous users, e.g. for large events.

Common problems with 802.11 wireless LANs

Interference

In the 2.4 GHz band, the number of usable channels (frequency) is low. Adjacent channels use overlapping frequencies, so there are typically only three truly non-overlapping channels in this band - channels 1, 6, and 11 are frequently used. In campus networks requiring many access points, care must be taken to avoid interference between same-channel access points. The problem is even more severe in areas where access points are deployed without coordination (such as in residential areas). Some modern access points can sense the radio environment during initialization, and try to use a channel that doesn't suffer from much interference. The 2.4 GHz is also used by other technologies such as Bluetooth or microwave ovens.

Capacity loss due to backwards compatibility or slow stations

The radio link in 802.11 can work at many different data rates below the nominal rate. For example, the 802.11g (54 Mb/s) access point to which I am presently connected supports operation at 1, 2, 5.5, 6, 9, 11, 12, 48, 18, 24, 36, or 54 Mb/s. Using the lower speeds can be useful in terms of adverse radio transmission conditions. In addition, it allows backwards compatibility - for example, 802.11g equipment interoperates with older 802.11b equipment, albeit at most at the lower 11 Mb/s supported by 802.11b.

When lower-rate and higher-rate stations coexist on the same access point, it should be noticed that the lower-rate station will occupy disproportionally more of the medium's capacity, because of increased serialization times at lower rates. So a single station operating at 1 Mb/s and transferring data at 500 kb/s will consume an equal part of the access point's capacity as 54 stations also transferring 500 kb/s each, but at a 54 Mb/s wireless rate.

Multicast/broadcast

Wireless is a "natural" broadcast medium, so broadcast and multicast should be relatively efficient. But the access point normally sends multicast and broadcast frames at a low rate, to increase the probability that all stations can actually receive them. Thus, multicast traffic streams can quickly consume a large fraction of an access point's capacity as per the considerations in the preceding section.

This is a reason why wireless networks often aren't multicast-enabled even in environments that typically have multicast connectivity (such as campus networks). Note that broadcast and multicast cannot easily be disabled completely, because they are required for lower-layer protocols such as ARP (broadcast) or IPv6 Neighbor Discovery (multicast) for work.

802.11n Performance Features

IEEE 802.11n is an recent addition to the standards for wireless LAN offering higher performance in terms of both capacity ("bandwidth") and reach. The standard supports both bands, although "consumer" 802.11n products often work with 2.4 GHz only unless marked "dual-band". Within each band, 802.11n equipment is normally backwards compatible with the respective prior standards, i.e. 802.11b/g for 2.4 GHz and 802.11a for 5 GHz. 802.11n achieves performance increases by

  • physical diversity using multiple antennas in a "MIMO" multiple in/multiple out scheme
  • the option to use wider channels (spaced 40 MHz rather than 20 MHz)
  • frame aggregation options at the MAC levels, allowing larger packets or bundling of multiple frames to a single radio frame, in order to better amortize the (relatively high) link access overhead.

References

-- Simon Leinen - 2005-12-15 - 2018-02-04

Revision 182006-04-07 - SimonLeinen

Line: 1 to 1
 
META TOPICPARENT name="AutoToc"

Unrolled PERTKB Contents

Line: 67 to 67
 

Active Measurement Tools

Active measurement injects traffic into a network to measure properties about that network.

  • ping - a simple RTT and loss measurement tool using ICMP ECHO messages
  • fping - ping variant supporting concurrent measurement to multiple destinations
  • SmokePing - nice graphical representation of RTT distribution and loss rate from periodic "pings"
  • OWAMP - a protocol and tool for one-way measurements

There are several permanent infrastructures performing active measurements:

  • HADES DFN-developed HADES Active Delay Evaluation System, deployed in GEANT/GEANT2 and DFN
  • RIPE TTM
  • QoSMetricsBoxes used in RENATER's Métrologie project

-- SimonLeinen - 29 Mar 2006

ping

Ping sends ICMP echo request messages to a remote host and waits for ICMP echo replies to determine the latency between those hosts. The output shows the Round Trip Time (RTT) between the host machine and the remote target. Ping is often used to determine whether a remote host is reachable. Unfortunately, it is quite common these days for ICMP traffic to be blocked by packet filters / firewalls, so a ping timing out does not necessarily mean that a host is unreachable.

(Another method for checking remote host availability is to telnet to a port that you know to be accessible; such as port 80 (HTTP) or 25 (SMTP). If the connection is still timing out, then the host is probably not reachable - often because of a filter/firewall silently blocking the traffic. When there is no service listening on that port, this usually results in a "Connection refused" message. If a connection is made to the remote host, then it can be ended by typing the escape character (usually 'CTRL ^ ]') then quit.)

The -f flag may be specified to send ping packets as fast as they come back, or 100 times per second - whichever is more frequent. As this option can be very hard on the network only a super-user (the root account on *NIX machines) is allowed to specified this flag in a ping command.

The -c flag specifies the number of Echo request messages sent to the remote host by ping. If this flag isn't used, then the ping continues to send echo request messages until the user types CTRL C. If the ping is cancelled after only a few messages have been sent, the RTT summary statistics that are displayed at the end of the ping output may not have finished being calculated and won't be completely accurate. To gain an accurate representation of the RTT, it is recommended to set a count of 100 pings. The MS Windows implementation of ping just sends 4 echo request messages by default.

Ping6 is the IPv6 implementation of the ping program. It works in the same way, but sends ICMPv6 Echo Request packets and waits for ICMPv6 Echo Reply packets to determine the RTT between two hosts. There are no discernable differences between the Unix implementations and the MS Windows implementation.

As with traceroute, Solaris integrates IPv4 and IPv6 functionality in a single ping binary, and provides options to select between them (-A <AF>).

See fping for a ping variant that supports concurrent measurements to multiple destinations and is easier to use in scripts.

Implementation-specific notes

-- TobyRodwell - 06 Apr 2005
-- MarcoMarletta - 15 Nov 2006 (Added the difference in size between Juniper and Cisco)

fping

fping is a ping like program which uses the Internet Control Message Protocol (ICMP) echo request to determine if a host is up. fping is different from ping in that you can specify any number of hosts on the command line, or specify a file containing the lists of hosts to ping. Instead of trying one host until it timeouts or replies, fping will send out a ping packet and move on to the next host in a round-robin fashion. If a host replies, it is noted and removed from the list of hosts to check. If a host does not respond within a certain time limit and/or retry limit it will be considered unreachable.

Unlike ping, fping is meant to be used in scripts and its output is easy to parse. It is often used as a probe for packet loss and round-trip time in SmokePing.

fping version 3

The original fping was written by Roland Schemers in 1992, and stopped being updated in 2002. In December 2011, David Schweikert decided to take up maintainership of fping, and increased the major release number to 3, mainly to reflect the change of maintainer. Changes from earlier versions include:

  • integration of all Debian patches, including IPv6 support (2.4b2-to-ipv6-16)
  • an optimized main loop that is claimed to bring performance close to the theoretical maximum
  • patches to improve SmokePing compatibility

The Web site location has changed to fping.org, and the source is now maintained on GitHub.

Since July 2013, there is also a Mailing List, which will be used to announce new releases.

Examples

Simple example of usage:

# fping -c 3 -s www.man.poznan.pl www.google.pl
www.google.pl     : [0], 96 bytes, 8.81 ms (8.81 avg, 0% loss)
www.man.poznan.pl : [0], 96 bytes, 37.7 ms (37.7 avg, 0% loss)
www.google.pl     : [1], 96 bytes, 8.80 ms (8.80 avg, 0% loss)
www.man.poznan.pl : [1], 96 bytes, 37.5 ms (37.6 avg, 0% loss)
www.google.pl     : [2], 96 bytes, 8.76 ms (8.79 avg, 0% loss)
www.man.poznan.pl : [2], 96 bytes, 37.5 ms (37.6 avg, 0% loss)

www.man.poznan.pl : xmt/rcv/%loss = 3/3/0%, min/avg/max = 37.5/37.6/37.7
www.google.pl     : xmt/rcv/%loss = 3/3/0%, min/avg/max = 8.76/8.79/8.81

       2 targets
       2 alive
       0 unreachable
       0 unknown addresses

       0 timeouts (waiting for response)
       6 ICMP Echos sent
       6 ICMP Echo Replies received
       0 other ICMP received

 8.76 ms (min round trip time)
 23.1 ms (avg round trip time)
 37.7 ms (max round trip time)
        2.039 sec (elapsed real time)

IPv6 Support

Jeroen Massar has added IPv6 support to fping. This has been implemented as a compile-time variant, so that there are separate fping (for IPv4) and fping6 (for IPv6) binaries. The IPv6 patch has been partially integrated into the fping version on www.fping.com as of release "2.4b2_to-ipv6" (thus also integrated in fping 3.0). Unfortunately his modifications to the build routine seem to have been lost in the integration, so that the fping.com version only installs the IPv6 version as fping. Jeroen's original version doesn't have this problem, and can be downloaded from his IPv6 Web page.

ICMP Sequence Number handling

Older versions of fping used the Sequence Number field in ICMP ECHO requests in a peculiar way: it used a different sequence number for each destination host, but used the same sequence number for all requests to a specific host. There have been reports of specific systems that suppress (or rate-limit) ICMP ECHO requests with repeated sequence numbers, which causes high loss rates reported from tools that use fping, such as SmokePing. Another issue is that fping could not distinguish a perfect link from one that drops every other packet and that duplicates every other.

Newer fping versions such as 3.0 or 2.4.2b2_to (on Debian GNU/Linux) include a change to sequence number handling attributed to Stephan Fuhrmann. These versions increment sequence numbers for every probe sent, which should solve both of these problems.

References

-- BartoszBelter - 2005-07-14 - 2005-07-26
-- SimonLeinen - 2008-05-19 - 2013-07-26

Added:
>
>

SmokePing

Smokeping is a software-based measurement framework that uses various software modules (called probes) to measure round trip times, packet losses and availability at layer 3 (IPv4 and IPv6) and even applications? latencies. Layer 3 measurements are based on the ping tool and for analysis of applications there are such probes as for measuring DNS lookup and RADIUS authentication latencies. The measurements are centrally controlled on a single host from which the software probes are started, which in turn emit active measurement flows. The load impact to the network by these streams is usually negligible.

As with MRTG the results are stored in RRD databases in the original polling time intervals for 24 hours and then aggregated over time, and the peak values and mean values on a 24h interval are stored for more than one year. Like MRTG, the results are usually displayed in a web browser, using html embedded graphs in daily, monthly, weekly and yearly timeframes. A particular strength of smoke ping is the graphical manner in which it displays the statistical distribution of latency values over time.

The tool has also an alarm feature that, based on flexible threshold rules, either sends out emails or runs external scripts.

The following picture shows an example output of a weekly graph. The background colour indicates the link availability, and the foreground lines display the mean round trip times. The shadows around the lines indicate graphically about the statistical distribution of the measured round-trip times.

  • Example SmokePing graph:
    Example SmokePing graph

The 2.4* versions of Smokeping included SmokeTrace, an AJAX-based traceroute tool similar to mtr, but browser/server based. This was removed in 2.5.0 in favor of a separate, more general, tool called remOcular.

In August 2013, Tobi Oetiker announced that he received funding for the development of SmokePing 3.0. This will use Extopus as a front-end. This new version will be developed in public on Github. Another plan is to move to an event-based design that will make Smokeping more efficient and allow it to scale to a large number of probes.

Related Tools

The OpenNMS network management platform includes a tool called StrafePing which is heavily inspired by SmokePing.

References

-- SimonLeinen - 2006-04-07 - 2013-11-03

 

OWAMP (One-Way Active Measurement Tool)

OWAMP is a command line client application and a policy daemon used to determine one way latencies between hosts. It is an implementation of the OWAMP protocol (see references) that is currently going through the standardization process within the IPPM WG in the IETF.

With roundtrip-based measurements, it is hard to isolate the direction in which congestion is experienced. One-way measurements solve this problem and make the direction of congestion immediately apparent. Since traffic can be asymmetric at many sites that are primarily producers or consumers of data, this allows for more informative measurements. One-way measurements allow the user to better isolate the effects of specific parts of a network on the treatment of traffic.

The current OWAMP implementation (V3.1) supports IPv4 and IPv6.

Example using owping (OWAMP v3.1) on a linux host:

welti@atitlan:~$ owping ezmp3
Approximately 13.0 seconds until results available

--- owping statistics from [2001:620:0:114:21b:78ff:fe30:2974]:52887 to [ezmp3]:41530 ---
SID:    feae18e9cd32e3ee5806d0d490df41bd
first:  2009-02-03T16:40:31.678
last:   2009-02-03T16:40:40.894
100 sent, 0 lost (0.000%), 0 duplicates
one-way delay min/median/max = 2.4/2.5/3.39 ms, (err=0.43 ms)
one-way jitter = 0 ms (P95-P50)
TTL not reported
no reordering


--- owping statistics from [ezmp3]:41531 to [2001:620:0:114:21b:78ff:fe30:2974]:42315 ---
SID:    fe302974cd32e3ee7ea1a5004fddc6ff
first:  2009-02-03T16:40:31.479
last:   2009-02-03T16:40:40.685
100 sent, 0 lost (0.000%), 0 duplicates
one-way delay min/median/max = 1.99/2.1/2.06 ms, (err=0.43 ms)
one-way jitter = 0 ms (P95-P50)
TTL not reported
no reordering

References

-- ChrisWelti - 03 Apr 2006 - 03 Feb 2009 -- SimonLeinen - 29 Mar 2006 - 01 Dec 2008

Active Measurement Boxes

-- FrancoisXavierAndreu & SimonLeinen - 06 Jun 2005 - 06 Jul 2005 Warning: Can't find topic PERTKB.IppmBoxes

Revision 172006-04-05 - SimonLeinen

Line: 1 to 1
 
META TOPICPARENT name="AutoToc"

Unrolled PERTKB Contents

Line: 94 to 94
 

Public Tools

Private Tools

Contact FCCN NOC for information from any of the following tools:

  • Netsaint - reachability and latency alarms
  • Multicast Beacon IPv4 - IPv4 multicast reachability matrix
  • RRDTool
  • Iperf (IPv4/ IPv6)

-- MonicaDomingues - 31 May 2005

Public Tools

Private Tools

Contact IUCC NOC (nocplus@noc.ilan.net.il) for information from any of the following tools:

IUCC is willing for PERT CMs to have access to non-public IUCC network information. If as a part of an investigation a PERT CM needs more info about the IUCC network they should contact Hank (or the IUCC NOC) for more assistance.

-- HankNussbacher - 02 Jun 2005 -- RafiSadowsky - 25 Mar 2006

Public Tools

Private Tools

  • IPv6 cricket -- IPv6 cricket
  • Cricket -- Traffic statistics for core and access circuits
  • Weathermap -- Hungarnet Weathermap
  • Router configuration database IPv4/IPv6 -- CVS database of router configurations
  • IPv6 looking glass -- allows ping tracerotue and bgp commands
  • Nagios -- latency and service availability alarms (both IPv6/IPv4)

-- MihalyMeszaros - 16 Mar 2007

Added:
>
>

RENATER - Network Monitoring Tools

Public Tools

-- FrancoisXavierAndreu - 05 Apr 2006

 

Network Emulation Tools

In general, it is difficult to assess performance of distributed applications and protocols before they are deployed on the real network, because the interaction with network impairments such as delay is hard to predict. Therefore, researchers and practitioners often use emulation to mimic deployment (typically wide-area) networks for use in laboratory testbeds.

The emulators listed below implement logical interfaces with configurable parameters such as delay, bandwidth (capacity) and packet loss rates.

-- TobyRodwell - 06 Apr 2005 -- SimonLeinen - 15 Dec 2005

Linux netem (introduced in recent 2.6 kernels)

NISTnet works only on 2.4.x Linux kernels, so for having the best support for recent GE cards, I would recommend kernel 2.6.10 or .11 and the new netem (Network emulator). A lot of similar functionality is built on this special QoS queue. It is buried into menuconfig:

 Networking -->
   Networking Options -->
     QoS and/or fair queuing -->
        Network emulator

You will need a recent iproute2 package that supports netem. The tc command is used to configure netem characteristics such as loss rates and distribution, delay and delay variation, and packet corruption rate (packet corruption was added in Linux version 2.6.16).

References

-- TobyRodwell - 14 Apr 2005
-- SimonLeinen - 21 Mar 2006 - 07 Jan 2007 -- based on information from David Martinez Moreno (RedIRIS). Larry Dunn from Cisco has set up a portable network emulator based on NISTnet. He wrote this on it:

My systems started as 2x100BT, but not for the (cost) reasons you suspect. I have 3 cheap Linux systems in a mobile rack (3x1RU, <$5k USD total). They are attached in a "row", 2 with web100, 1 with NISTNet like this: web100 - NISTNet - web100. The NISTNet node acts as a variable delay-line (emulates different fiber propagation delays), and as a place to inject loss, if desired. DummyNet/BSD could work as well.

I was looking for a very "shallow" (front-to-back) 1RU system, so I could fit it into a normal-sized Anvil case (for easy shipping to Networkers, etc). Penguin Computing was one of 2 companies I found that was then selling such "shallow" rack-mount systems. The shallow units were "low-end" servers, so they only had 2x100BT ports.

Their current low-end systems have two 1xGE ports, and are (still) around $1100 USD each. So a set of 3 systems (inc. rack), Linux preloaded, 3-yr warranty, is still easily under $5k (Actaully, closer to $4k, including a $700 Anvil 2'x2'x2' rolling shock-mount case, and a cheap 8-port GE switch). Maybe $5k if you add more memory & disk...

Here's a reference to their current low-end stuff:

http://penguincomputing.com/products/servers/relion1XT.php?PHPSESSID=f90f88b4d54aad4eedc2bdae33796032

I have no affiliation with Penguin, just happened to buy their stuff, and it's working OK (though the fans are pretty loud). I suppose there are comparable vendors in Europe.

-- TobyRodwell - 14 Apr 2005

Revision 162006-04-04 - SimonLeinen

Line: 1 to 1
 
META TOPICPARENT name="AutoToc"

Unrolled PERTKB Contents

Line: 14 to 14
 

Delay Variation ("Jitter")

Delay variation or "jitter" is a metric that describes the level of disturbance of packet arrival times with respect to an "ideal" pattern, typically the pattern in which the packets were sent. Such disturbances can be caused by competing traffic (i.e. queueing), or by contention on processing resources in the network.

RFC 3393 defines an IP Delay Variation Metric (IPDV). This particular metric only compares the delays experienced by packets of equal size, on the grounds that delay is naturally dependent on packet size, because of serialization delay.

Delay variation is an issue for real-time applications such as audio/video conferencing systems. They usually employ a jitter buffer to eliminate the effects of delay variation.

Delay variation is related to packet reordering. But note that the RFC 3393 IPDV of a network can be arbitrarily low, even zero, even though that network reorders packets, because the IPDV metric only compares delays of equal-sized packets.

Decomposition

Jitter is usually introduced in network nodes (routers), as an effect of queueing or contention for forwarding resources, especially on CPU-based router architectures. Some types of links can also introduce jitter, for example through collision avoidance (shared Ethernet) or link-level retransmission (802.11 wireless LANs).

Measurement

Contrary to one-way delay, one-way delay variation can be measured without requiring precisely synchronized clocks at the measurement endpoints. Many tools that measure one-way delay also provide delay variation measurements.

References

The IETF IPPM (IP Performance Metrics) Working Group has formalized metrics for IPDV, and more recently started work on an "applicability statement" that explains how IPDV can be used in practice and what issues have to be considered.

  • RFC 3393, IP Packet Delay Variation Metric for IP Performance Metrics (IPPM), C. Demichelis, P. Chimento. November 2002.
  • draft-morton-ippm-delay-var-as-03.txt, Packet Delay Variation Applicability Statement, A. Morton, B. Claise, July 2007.

-- SimonLeinen - 28 Oct 2004 - 24 Jul 2007

Packet Loss

Packet loss is the probability of a packet to be lost in transit from a source to a destination.

A One-way Packet Loss Metric for IPPM is defined in RFC 2680. RFC 3357 contains One-way Loss Pattern Sample Metrics.

Decomposition

There are two main reasons for packet loss:

Congestion

When the offered load exceeds the capacity of a part of the network, packets are buffered in queues. Since these buffers are also of limited capacity, severe congestion can lead to queue overflows, which lead to packet drops. In this context, severe congestion could mean that a moderate overload condition holds for an extended amount of time, but could also consist of the sudden arrival of a very large amount of traffic (burst).

Errors

Another reason for loss of packets is corruption, where parts of the packet are modified in-transit. When such corruptions happen on a link (due to noisy lines etc.), this is usually detected by a link-layer checksum at the receiving end, which then discards the packet.

Impact on end-to-end performance

Bulk data transfers usually require reliable transmission, so lost packets must be retransmitted. In addition, congestion-sensitive protocols such as TCP must assume that packet loss is due to congestion, and reduce their transmission rate accordingly (although recently there have been some proposals to allow TCP to identify non-congestion related losses and treat those differently).

For real-time applications such as conversational audio/video, it usually doesn't make much sense to retransmit lost packets, because the retransmitted copy would arrive too late (see delay variation). The result of packet loss is usually a degradation in sound or image quality. Some modern audio/video codecs provide a level of robustness to loss, so that the effect of occasional lost packets are benign. On the other hand, some of the most effective image compression methods are very sensitive to loss, in particular those that use relatively rare "anchor frames", and that represent the intermediate frames by compressed differences to these anchor frames - when such an anchor frame is lost, many other frames won't be able to be reconstructed.

Measurement

Packet loss can be actively measured by sending a set of packets from a source to a destination and comparing the number of received packets against the number of packets sent.

Network elements such as routers also contain counters for events such as checksum errors or queue drops, which can be retrieved through protocols such as SNMP. When this kind of access is available, this can point to the location and cause of packet losses.

Reducing packet loss

Congestion-induced packet loss can be avoided by proper provisioning of link capacities. Depending on the probability of bursts (which is somewhat difficult to estimate, taking into account both link capacities in the network, and the traffic rates and patterns of a possibly large number of hosts at the edge of the network), buffers in network elements such as routers must also be sufficient. Note that large buffers can be detrimental to other aspects of network performance, in particular one-way delay (and thus round-trip time) and delay variation.

Quality-of-Service mechanisms such as DiffServ or IntServ can be used to protect some subset of traffic against loss, but this necessarily increases packet loss for the remaining traffic.

Lastly, Active Queue Management (AQM) and Excplicit Congestion Notification (ECN) can be used to mitigate both packet loss and queueing delay (and thus one-way delay, round-trip time and delay variation).

References

  • A One-way Packet Loss Metric for IPPM, G. Almes, S. Kalidindi, M. Zekauskas, September 1999, RFC 2680
  • Improving Accuracy in Endtoend Packet Loss Measurement, J. Sommers, P. Barford, N. Duffield, A. Ron, August 2005, SIGCOMM'05 (PDF)
-- SimonLeinen - 01 Nov 2004

Packet Reordering

The Internet Protocol (IP) does not guarantee that packets are delivered in the order in which they were sent. This was a deliberate design choice that distinguishes IP from protocols such as, for instance, ATM and IEEE 802.3 Ethernet.

Decomposition

Reasons why a network may reorder packets: Usually because of some kind of parallelism, either because of a choice of alternative routes (Equal Cost Multipath, ECMP), or because of internal parallelism inside switching elements such as routers. One particular kind of packet reordering concerns packets of different sizes. Because a larger packet takes longer to transfer over a serial link (or a limited-width backplane inside a router), larger packets may be "overtaken" by smaller packets that were sent subsequently. This is usually not a concern for high-speed bulk transfers - where the segments tend to be equal-sized (hopefully Path MTU-sized), but may pose problems for naive implementations of multi-media (Audio/Video) transport.

Impact on end-to-end performance

In principle, applications that use a transport protocol such as TCP or SCTP don't have to worry about packet reordering, because the transport protocol is responsible for reassembling the byte stream (message stream(s) in the case of SCTP) into the original ordering. However, reordering can have a severe performance impact on some implementations of TCP. Recent TCP implementations, in particular those that support Selective Acknowledgements (SACK), can exhibit robust performance even in the face of reordering in the network.

Real-time media applications such as audio/video conferencing tools often experience problems when run over networks that reorder packets. This is somewhat remarkable in that all of these applications have jitter buffers to eliminate the effects of delay variation on the real-time media streams. Obviously, the code that manages these jitter buffers is often not written in a way to accomodate reordered packets sensibly, although this could be done with moderate effort.

Measurement

Packet reordering is measured by sending a numbered sequence of packets, and comparing the received sequence number sequence with the original. There are many possible ways of quantifying reordering, some of which are described in the IPPM Working Group's documents (see the references below).

The measurements here can be done differently, depending on the measurement purpose:

a) measuring reordering for particular application can be done by capturing the application traffic (e.g. using the Wireshark/Ethereal tool), injecting the same traffic pattern via traffic generator and calculating the reordering.

b) measuring maximal reordering introduced by the network can be done by injecting relatively small amount of traffic, shaped as a short bursts of long packets immediately followed by short burst of short packets, with line rate. After capture and calculation on the other end of the path, the results will reflect the worst possible packet reordering situation which may occur on particular path.

For more information please refer to the measurement page, available at packet reordering - measurement site. Please notice, that although the tool is based on older versions of reordering drafts, the metrics used are compatible with the definitions from new ones.

In particular Reorder Density and Reorder Buffer-Occupancy Density can be obtained from tcpdump by subjecting the dump files to RATtool, available at Reordering Analysis Tool Website

Improving reordering

The probability of reordering can be reduced by avoiding parallelism in the network, or by using network nodes that take care to use parallel paths in such a way that packets belonging to the same end-to-end flow are kept on a single path. This can be ensured by using a hash on the destination address, or the (source address, destination address) pair to select from the set of available paths. For certain types of parallelism this is hard to achieve, for example when a simple striping mechanism is used to distribute packets from a very high-speed interface to multiple forwarding paths.

References

-- SimonLeinen - 2004-10-28 - 2014-12-29
-- MichalPrzybylski - 2005-07-19
-- BartoszBelter - 2006-03-28

Changed:
<
<
Warning: Can't find topic PERTKB.MaximumTransferUnitPathMTU
>
>
Warning: Can't find topic PERTKB.MaximumTransferUnit

Path MTU

The Path MTU is the Maximum Transmission Unit (MTU) supported by a network path. It is the minimum of the MTUs of the links (segments) that make up the path. Larger Path MTUs generally allow for more efficient data transfers, because source and destination hosts, as well as the switching devices (routers) along the network path have to process fewer packets. However, it should be noted that modern high-speed network adapters have mechanisms such as LSO (Large Send Offload) and Interrupt Coalescence that diminish the influence of MTUs on performance. Furthermore, routers are typically dimensioned to sustain very high packet loads (so that they can resist denial-of-service attacks) so the packet rates caused by high-speed transfers is not normally an issue for today's high-speed networks.

The prevalent Path MTU on the Internet is now 1500 bytes, the Ethernet MTU. There are some initiatives to support larger MTUs (JumboMTU) in networks, in particular on research networks. But their usability is hampered by last-mile issues and lack of robustness of RFC 1191 Path MTU Discovery.

Path MTU Discovery Mechanisms

Traditional (RFC1191) Path MTU Discovery

RFC 1191 describes a method for a sender to detect the Path MTU to a given receiver. (RFC 1981 describes the equivalent for IPv6.) The method works as follows:

  • The sending host will send packets to off-subnet destinations with the "Don't Fragment" bit set in the IP header.
  • The sending host keeps a cache containing Path MTU estimates per destination host address. This cache is often implemented as an extension of the routing table.
  • The Path MTU estimate for a new destination address is initialized to the MTU of the outgoing interface over which the destination is reached according to the local routing table.
  • When the sending host receives an ICMP "Too Big" (or "Fragmentation Needed and Don't Fragment Bit Set") destination-unreachable message, it learns that the Path MTU to that destination is smaller than previously assumed, and updates the estimate accordingly.
  • Normally, an ICMP "Too Big" message contains the next-hop MTU, and the sending host will use that as the new Path MTU estimate. The estimate can still be wrong because a subsequent link on the path may have an even smaller MTU.
  • For destination addresses with a Path MTU estimate lower than the outgoing interface MTU, the sending host will occasionally attempt to raise the estimate, in case the path has changed to support a larger MTU.
  • When trying ("probing") a larger Path MTU, the sending host can use a list of "common" MTUs, such as the MTUs associated with popular link layers, perhaps combined with popular tunneling overheads. This list can also be used to guess a smaller MTU in case an ICMP "Too Big" message is received that doesn't include any information about the next-hop MTU (maybe from a very very old router).

This method is widely implemented, but is not robust in today's Internet because it relies on ICMP packets sent by routers along the path. Such packets are often suppressed either at the router that should generate them (to protect its resources) or on the way back to the source, because of firewalls and other packet filters or rate limitations. These problems are described in RFC2923. When packets are lost due to MTU issues without any ICMP "Too Big" message, this is sometimes called a (MTU) black hole. Some operating systems have added heuristics to detect such black holes and work around them. Workarounds can include lowering the MTU estimate or disabling PMTUD for certain destinations.

Packetization Layer Path MTU Discovery (PLPMTUD, RFC 4821)

An IETF Working Group (pmtud) was chartered to define a new mechanism for Path MTU Discovery to solve these issues. This process resulted in RFC 4821, Packetization Layer Path MTU Discovery ("PLPMTUD"), which was published in March 2007. This scheme requires cooperation from a network layer above IP, namely the layer that performs "packetization". This could be TCP, but could also be a layer above UDP, let's say an RPC or file transfer protocol. PLPMTUD does not require ICMP messages. The sending packetization layer starts with small packets, and probes progressively larger sizes. When there's an indication that a larger packet was successfully transmitted to the destination (presumably because some sort of ACK was received), the Path MTU estimate is raised accordingly.

When a large packet was lost, this might have been due to an MTU limitation, but it might also be due to other causes, such as congestion or a transmission error - or maybe it's just the ACK that was lost! PLPMTUD recommends that the first time this happens, the sending packetization layer should assume an MTU issue, and try smaller packets. An isolated incident need not be interpreted as an indication of congestion.

An implementation of the new scheme for the Linux kernel was integrated into version 2.6.17. It is controlled by a "sysctl" value that can be observed and set through /proc/sys/net/ipv4/tcp_mtu_probing. Possible values are:

  • 0: Don't perform PLPMTUD
  • 1: Perform PLPMTUD only after detecting a "blackhole" in old-style PMTUD
  • 2: Always perform PLPMTUD, and use the value of tcp_base_mss as the initial MSS.

A user-space implementation over UDP is included in the VFER bulk transfer tool.

References

Implementations

  • for Linux 2.6 - integrated in mainstream kernel as of 2.6.17. However, it is disabled by default (see net.ipv4.tcp_mtu_probing sysctl)
  • for NetBSD

-- Hank Nussbacher - 2005-07-03 -- Simon Leinen - 2006-07-19 - 2018-07-02

 

Network Protocols

This section contains information about a few common Internet protocols, with a focus on performance questions.

For information about higher-level protocols, see under application protocols.

-- SimonLeinen - 2004-10-31 - 2015-04-26

TCP (Transmission Control Protocol)

The Transmission Control Protocol (TCP) is the prevalent transport protocol used on the Internet today. It uses window-based transmission to provide the service of a reliable byte stream, and adapts the rate of transfer to the state (of congestion) of the network and the receiver. Basic mechanisms include:

  • Segments that fit into IP packets, into which the byte-stream is split by the sender,
  • a checksum for each segment,
  • a Window, which bounds the amount of data "in flight" between the sender and the receiver,
  • Acknowledgements, by which the receiver tells the sender about segments that were received successfully,
  • a flow-control mechanism, which regulates data rate as a function of network and receiver conditions.

Originally specified in September 1981 RFC 793, TCP was clarified, refined and extended in many documents. Perhaps most importantly, congestion control was introduced in "TCP Tahoe" in 1988, described in Van Jacobson's 1988 SIGCOMM article on "Congestion Avoidance and Control". It can be said that TCP's congestion control is what keeps the Internet working when links are overloaded. In today's Internet, the enhanced "Reno" variant of congestion control is probably the most widespread.

RFC 7323 (formerly RFC 1323) specifies a set of "TCP Extensions for High Performance", namely the Window Scaling Option, which provides for much larger windows than the original 64K, the Timestamp Option and the PAWS (Protection Against Wrapped Sequence numbers) mechanism. These extensions are supported by most contemporary TCP stacks, although they frequently must be activated explicitly (or implicitly by configuring large TCP windows).

Another widely implemented performance enhancement to TCP are selective acknowledgements (SACK, RFC 2018). In TCP as originally specified, the acknowledgements ("ACKs") sent by a receiver were always "cumulative", that is, they specified the last byte of the part of the stream that was completely received. This is advantageous with large TCP windows, in particular where chances are high that multiple segments within a single window are lost. RFC 2883 describes an extension to SACK which makes TCP more robust in the presence of packet reordering in the network.

In addition, RFC 3449 - TCP Performance Implications of Network Path Asymmetry provides excellent information since a vast majority of Internet connections are asymmetrical.

Further Reading

References

  • RFC 7414, A Roadmap for Transmission Control Protocol (TCP) Specification Documents]], M. Duke, R. Braden, W. Eddy, E. Blanton, A. Zimmermann, February 2015
  • RFC 793, Transmission Control Protocol]], J. Postel, September 1981
  • draft-ietf-tcpm-rfc793bis-10, Transmission Control Protocol Specification, Wesley M. Eddy, July 2018
  • RFC 7323, TCP Extensions for High Performance, D. Borman, B. Braden, V. Jacobson, B. Scheffenegger, September 2014 (obsoletes RFC 1323 from May 1993)
  • RFC 2018, TCP Selective Acknowledgement Options]], M. Mathis, J. Mahdavi, S. Floyd, A. Romanow, October 1996
  • RFC 5681, TCP Congestion Control]], M. Allman, V. Paxson, E. Blanton, September 2009
  • RFC 3449, TCP Performance Implications of Network Path Asymmetry]], H. Balakrishnan, V. Padmanabhan, G. Fairhurst, M. Sooriyabandara, December 2002

There is ample literature on TCP, in particular research literature on its performance and large-scale behaviour.

Juniper has a nice white paper, Supporting Differentiated Service Classes: TCP Congestion Control Mechanisms (PDF format) explaining TCP's congestion control as well as many of the enhancements proposed over the years.

-- Simon Leinen - 2004-10-27 - 2018-07-02 -- Ulrich Schmid - 2005-05-31

Window-Based Transmission

TCP is a sliding-window protocol. The receiver tells the sender the available buffer space at the receiver (TCP header field "window"). The total window size is the minimum of sender buffer size, advertised receiver window size and congestion window size.

The sender can transmit up to this amount of data before having to wait for further buffer update from the receiver and should not have more than this amount of data in transit in the network. The sender must buffer the sent data until it has been ACKed by the receiver, so that the data can be retransmitted if neccessary. For each ACK the sent segment left the window and a new segment fills the window if it fits the (possibly updated) window buffer.

Due to TCP's flow control mechanism, TCP window size can limit the maximum theoretical throughput regardless of the bandwidth of the network path. Using too small a TCP window can degrade the network performance lower than expected and a too large window may have the same problems in case of error recovery.

The TCP window size is the most important parameter for achieving maximum throughput across high-performance networks. To reach the maximum transfer rate, the TCP window should be no smaller than the bandwidth-delay product.

Window size (bytes) => Bandwidth (bytes/sec) x Round-trip time (sec)

Example:

window size: 8192 bytes
round-trip time: 100ms
maximum throughput: < 0.62 Mbit/sec.

References

<--   * Calculating TCP goodput, Javascript-based on-line calculator by V. Reijs, https://webmail.heanet.ie/~vreijs/tcpcalculations.htm -->

-- UlrichSchmid & SimonLeinen - 31 May-07 Jun 2005

Line: 31 to 32
 

Hamilton TCP

Hamilton TCP (H-TCP), developed by Hamilton Institute, is an enhancement to existing TCP congestion control protocol, with good mathematic and empirical basis. The authors makes an assumption that their improvement should behave as regular TCP in standard LAN and WAN networks where RTT and link capacity are not extremely high. But in the long distant fast networks, H-TCP is far more aggressive and more flexible while adjusting to available bandwidth and avoiding packet losses. Therefore two modes are used for congestion control and mode selection depends on the time since the last experienced congestion. In regular TCP window control algorithm, the window is increased by adding a constant value α and decreased by multiplying by β. The corresponding values are by default equals to 1 and 0.5. In H-TCP case the window estimation is a bit more complex, because both factors are not constant values, but are calculated during transmission.

The parameter values are estimated as follows. For each acknowledgment set:

HTCP_alpha.png

and then

HTCP_alpha2.png

On each congestion event set:

HTCP_beta.png

Where,

  • Δ_i is the time that elapsed since the last congestion experienced by i'th source,
  • Δ_L is the threshold for switching between fast and slow modes,
  • B_i(k) is the throughput of flow i immediately before the k'th congestion event,
  • RTT_max,i and RTT_min,i are maximum and minimum round trip time values for the i'th flow.

The α value depends on the time between experienced congestion events, while β depends on RTT and achieved bandwidth values. The constant values, as well as the Δ_L value, can be modified, in order to adjust the algorithm behavior to user expectations, but the default values are estimated empirically and seem to be the most suitable in most cases. The figures below present the behavior of congestion window, throughput and RTT values during tests performed by the H-TCP authors. In the first case β value was set to 0.5, in the second case the adaptive approach (explained with the equations above) was used. The throughput in the second case is far more stable and is closing to its maximal value.

HTCP_figure1.jpg

HTCP_figure2.jpg

H-TCP congestion window and throughput behavior (source: �H-TCP: TCP for high-speed and long-distance networks�, D.Leith, R.Shorten).

Summary

Another advantage of H-TCP is fairness and friendliness, which means that the algorithm is not "greedy", and will not consume the whole link capacity, but is able to fairly share the bandwidth with either another H-TCP like transmissions or any other TCP-like transmissions. H-TCP congestion window control improves the dynamic of sending ratio and therefore provides better overall throughput value than regular TCP implementation. On the Hamilton Institute web page (http://www.hamilton.ie/net/htcp/) more information is available, including papers, test results and algorithm implementations for Linux kernels 2.4 and 2.6.

References

-- WojtekSronek - 13 Jul 2005 -- SimonLeinen - 30 Jan 2006 - 14 Apr 2008

TCP Westwood

The current Standard TCP implementation rely on packet loss as an indicator of network congestion. The problem in TCP is that it does not possess the capability to distinguish congestion loss from loss invoked by noisy links. As a consequence, Standard TCP reacts with a drastic reduction of the congestion window. In wireless connections overlapping radio channels, signal attenuation, additional noises have a huge impact on such losses.

TCP Westwood (TCPW) is a small modification of Standard TCP congestion control algorithm. When the sender perceive that congestion has appeared, the sender uses the estimated available bandwidth to set the congestion window and the slow start threshold sizes. TCP Westwood avoids huge reduction of these values and ensure both faster recovery and effective congestion avoidance. It does not require any support from lower and higher layers and does not need any explicit congestion feedback from the network.

Measurement of bandwidth in TCP Westwood lean on simple relationship between amount of data sent by the sender to the time of receiving an acknowledgment. The estimation bandwidth process is a bit similar to the one used for RTT estimation in Standard TCP with additional improvements. When there is an absence of ACKs because no packets were sent, the estimated value goes to zero. The formulas of estimation bandwidth process is shown in many documents about TCP Westwood. Such documents are listed on the UCLA Computer Science Department website.

There are some issues that can potentially disrupt the bandwidth estimation process such a delayed or cumulative ACKs indicating wrong order of received segments. So the source must keep track of the number of duplicated ACKs and should be able to detect delayed ACKs and act accordingly.

As it was said before the general idea is to use the bandwidth estimate value to set the congestion window and the slow start threshold after a congestion episode. Below there is an algorithm of setting these values.

if (n duplicate ACKs are received) {
   ssthresh = (BWE * RTTmin)/seg_size;
   if (cwin > ssthresh) cwin = ssthresh; /* congestion avoid. */
}

where

  • BWE is a estimated value of bandwidth,
  • RTTmin is the smallest RTT value observed during the connection time,
  • seg_size is a length of TCP’s segment payload.

Additional benefits

Based on the results of testing TCP Westwood we can see that its fairness is at least as good, than in widely used Standard TCP even if two flows have different round trip times. The main advantage of new implementation of TCP is that it does not starve the others variants of TCP. So, we can claim that TCP Westwood is also friendly.

References

-- WojtekSronek - 27 Jul 2005

TCP Selective Acknowledgements (SACK)

Selective Acknowledgements are a refinement of TCP's traditional "cumulative" acknowledgements.

SACKs allow a receiver to acknowledge non-consecutive data, so that the sender can retransmit only what is missing at the receiver�s end. This is particularly helpful on paths with a large bandwidth-delay product (BDP).

TCP may experience poor performance when multiple packets are lost from one window of data. With the limited information available from cumulative acknowledgments, a TCP sender can only learn about a single lost packet per round trip time. An aggressive sender could choose to retransmit packets early, but such retransmitted segments may have already been successfully received.

A Selective Acknowledgment (SACK) mechanism, combined with a selective repeat retransmission policy, can help to overcome these limitations. The receiving TCP sends back SACK packets to the sender informing the sender of data that has been received. The sender can then retransmit only the missing data segments.

Multiple packet losses from a window of data can have a catastrophic effect on TCP throughput. TCP uses a cumulative acknowledgment scheme in which received segments that are not at the left edge of the receive window are not acknowledged. This forces the sender to either wait a roundtrip time to find out about each lost packet, or to unnecessarily retransmit segments which have been correctly received. With the cumulative acknowledgment scheme, multiple dropped segments generally cause TCP to lose its ACK-based clock, reducing overall throughput. Selective Acknowledgment (SACK) is a strategy which corrects this behavior in the face of multiple dropped segments. With selective acknowledgments, the data receiver can inform the sender about all segments that have arrived successfully, so the sender need retransmit only the segments that have actually been lost.

The selective acknowledgment extension uses two TCP options. The first is an enabling option, "SACK-permitted", which may be sent in a SYN segment to indicate that the SACK option can be used once the connection is established. The other is the SACK option itself, which may be sent over an established connection once permission has been given by SACK-permitted.

Blackholing Issues

Enabling SACK globally used to be somewhat risky, because in some parts of the Internet, TCP SYN packets offering/requesting the SACK capability were filtered, causing connection attempts to fail. By now, it seems that the increased deployment of SACK has caused most of these filters to disappear.

Performance Issues

Sometimes it is not recommended to enable SACK feature, e.g. for the Linux 2.4.x TCP SACK implementation suffers from significant performance degradation in case of a burst of packet loss. People from CERN observed that a burst of packet loss considerably affects TCP connections with large bandwidth delay product (several MBytes), because the TCP connection doesn't recover as it should. After a burst of loss, the throughput measured in their testbed is close to zero during 70 seconds. This behavior is not compliant with TCP's RFC. A timeout should occur after a few RTTs because one of the loss couldn't be repaired quickly enough and the sender should go back into slow start.

For more information take a look at http://sravot.home.cern.ch/sravot/Networking/TCP_performance/tcp_bug.htm

Additionally, work done at Hamilton Institute found that SACK processing in the Linux kernel is inefficient even for later 2.6 kernels, where at 1Gbps networks performance for a long file transfer can lose about 100Mbps of potential. Most of these issues should have been fixed in Linux kernel 2.6.16.

Detailed Explanation

The following is closely based on a mail that Baruch Even sent to the pert-discuss mailing list on 25 Jan '07:

The Linux TCP SACK handling code has in the past been extremely inefficient - there were multiple passes on a linked list which holds all the sent packets. This linked list on a large BDP link can span 20,000 packets. This meant multiple traversals of this list take longer than it takes for another packet to come. Pretty quickly after a loss with SACK the sender's incoming queue fills up and ACKs start getting dropped. There used to be an anti-DoS mechanism that would drop all packets until the queue had emptied - with a default of 1000 packets that took a long time and could easily drop all the outstanding ACKs resulting in a great degradation of performance. This value is set with proc/sys/net/core/netdev_max_backlog

This situation has been slightly improved in that though there is still a 1000 packet limit for the network queue, it acts as a normal buffer i.e. packets are accepted as soon as the queue dips below 1000 again.

Kernel 2.6.19 should be the preferred kernel now for high speed networks. It is believed it has the fixes for all former major issues (at least those fixes that were accepted to the kernel), but it should be noted that some other (more minor) bugs have appeared, and will need to be fixed in future releases.

Historical Note

Selective acknowledgement schemes were known long before they were added to TCP. Noel Chiappa mentioned that Xerox' PUP protocols had them in a posting to the tcp-ip mailing list in August 1986. Vint Cerf responds with a few notes about the thinking that lead to the cumulative form of the original TCP acknowledgements.

References

-- UlrichSchmid & SimonLeinen - 02 Jun 2005 - 14 Jan 2007
-- BartoszBelter - 16 Dec 2005
-- BaruchEven - 05 Jan 2006

Added:
>
>

Automatic Tuning of TCP Buffers

Note: This mechanism is sometimes referred to as "Dynamic Right-Sizing" (DRS).

The issues mentioned under "Large TCP Windows" are arguments in favor of "buffer auto-tuning", a promising but relatively new approach to better TCP performance in operating systems. See the TCP auto-tuning zoo reference for a description of some approaches.

Microsoft introduced (receive-side) buffer auto-tuning in Windows Vista. This implementation is explained in a TechNet Magazine "Cable Guy" article.

FreeBSD introduced buffer auto-tuning as part of its 7.0 release.

Mac OS X introduced buffer auto-tuning in release 10.5.

Linux auto-tuning details

Some automatic buffer tuning is implemented in Linux 2.4 (sender-side), and Linux 2.6 implements it for both the send and receive directions.

In a post to the web100-discuss mailing list, John Heffner describes the Linux 2.6.16 (March 2006) Linux implementation as follows:

For the sender, we explored separating the send buffer and retransmit queue, but this has been put on the back burner. This is a cleaner approach, but is not necessary to achieve good performance. What is currently implemented in Linux is essentially what is described in Semke '98, but without the max-min fair sharing. When memory runs out, Linux implements something more like congestion control for reducing memory. It's not clear that this is well-behaved, and I'm not aware of any literature on this. However, it's rarely used in practice.

For the receiver, we took an approach similar to DRS, but not quite the same. RTT is measured with timestamps (when available), rather than using a bounding function. This allows it to track a rise in RTT (for example, due to path change or queuing). Also, a subtle but important difference is that receiving rate is measured by the amount of data consumed by the application, not data received by TCP.

Matt Mathis reports on the end2end-interest mailing list (26/07/06):

Linux 2.6.17 now has sender and receiver side autotuning and a 4 MB DEFAULT maximum window size. Yes, by default it negotiates a TCP window scale of 7.
4 MB is sufficient to support about 100 Mb/s on a 300 ms path or 1 Gb/s on a 30 ms path, assuming you have enough data and an extremely clean (loss-less) network.

References

  • TCP auto-tuning zoo, Web page by Tom Dunegan of Oak Ridge National Laboratory, PDF
  • A Comparison of TCP Automatic Tuning Techniques for Distributed Computing, E. Weigle, W. Feng, 2002, PDF
  • Automatic TCP Buffer Tuning, J. Semke, J. Mahdavi, M. Mathis, SIGCOMM 1998, PS
  • Dynamic Right-Sizing in TCP, M. Fisk, W. Feng, Proc. of the Los Alamos Computer Science Institute Symposium, October 2001, PDF
  • Dynamic Right-Sizing in FTP (drsFTP): Enhancing Grid Performance in User-Space, M.K. Gardner, Wu-chun Feng, M. Fisk, July 2002, PDF. This paper describes an implementation of buffer-tuning at application level for FTP, i.e. outside of the kernel.
  • Socket Buffer Auto-Sizing for High-Performance Data Transfers, R. Prasad, M. Jain, C. Dovrolis, PDF
  • The Cable Guy: TCP Receive Window Auto-Tuning, J. Davies, January 2007, Microsoft TechNet Magazine
  • What's New in FreeBSD 7.0, F. Biancuzzi, A. Oppermann et al., February 2008, ONLamp
  • How to disable the TCP autotuning diagnostic tool, Microsoft Support, Article ID: 967475, February 2009

-- SimonLeinen - 04 Apr 2006 - 21 Mar 2011
-- ChrisWelti - 03 Aug 2006

 

UDP (User Datagram Protocol)

The User Datagram Protocol (UDP) is a very simple layer over the host-to-host Internet Protocol (IP). It only adds 16-bit source and destination port numbers for multiplexing between different applications on the pair of hosts, and 16-bit length and checksum fields. It has been defined in RFC 768.

UDP is used directly for protocols such as the Domain Name System (DNS) or the Network Time Protocol (NTP), which consists on isolated request-response transactions between hosts, where the negotiation and maintenance of TCP connections would be prohibitive.

There are other protocols layered on top of UDP, for example the Real-time Transport Protocol (RTP) used in real-time media applications. UDT, VFER, RBUDP, Tsunami, and Hurricane are examples of UDP-based bulk transport protocols.

References

  • RFC 768, User Datagram Protocol, J. Postel, August 1980

-- SimonLeinen - 31 Oct 2004 - 02 Apr 2006

Real-Time Transport Protocol (RTP)

RTP (RFC 3550) is a generic transport protocol for real-time media streams such as audio or video. RTP is typically run over UDP. RTP's services include timestamps and identification of media types. RTP includes RTCP (RTP Control Protocol), whose primary use is to provide feedback about the quality of transmission in the form of Receiver Reports.

RTP is used as a framing protocol for many real-time audio/video applications, including those based on the ITU-T H.323 protocol and the Session Initiation Protocol (SIP).

References

  • RFC 3550, RTP: A Transport Protocol for Real-Time Applications, H. Schulzrinne, S. Casner, R. Frederick, V. Jacobson. July 2003

-- SimonLeinen - 28 Oct 2004 - 24 Nov 2007

Application Protocols

-- TobyRodwell - 28 Feb 2005

Revision 152006-04-03 - SimonLeinen

Line: 1 to 1
 
META TOPICPARENT name="AutoToc"

Unrolled PERTKB Contents

Line: 19 to 19
 

TCP (Transmission Control Protocol)

The Transmission Control Protocol (TCP) is the prevalent transport protocol used on the Internet today. It uses window-based transmission to provide the service of a reliable byte stream, and adapts the rate of transfer to the state (of congestion) of the network and the receiver. Basic mechanisms include:

  • Segments that fit into IP packets, into which the byte-stream is split by the sender,
  • a checksum for each segment,
  • a Window, which bounds the amount of data "in flight" between the sender and the receiver,
  • Acknowledgements, by which the receiver tells the sender about segments that were received successfully,
  • a flow-control mechanism, which regulates data rate as a function of network and receiver conditions.

Originally specified in September 1981 RFC 793, TCP was clarified, refined and extended in many documents. Perhaps most importantly, congestion control was introduced in "TCP Tahoe" in 1988, described in Van Jacobson's 1988 SIGCOMM article on "Congestion Avoidance and Control". It can be said that TCP's congestion control is what keeps the Internet working when links are overloaded. In today's Internet, the enhanced "Reno" variant of congestion control is probably the most widespread.

RFC 7323 (formerly RFC 1323) specifies a set of "TCP Extensions for High Performance", namely the Window Scaling Option, which provides for much larger windows than the original 64K, the Timestamp Option and the PAWS (Protection Against Wrapped Sequence numbers) mechanism. These extensions are supported by most contemporary TCP stacks, although they frequently must be activated explicitly (or implicitly by configuring large TCP windows).

Another widely implemented performance enhancement to TCP are selective acknowledgements (SACK, RFC 2018). In TCP as originally specified, the acknowledgements ("ACKs") sent by a receiver were always "cumulative", that is, they specified the last byte of the part of the stream that was completely received. This is advantageous with large TCP windows, in particular where chances are high that multiple segments within a single window are lost. RFC 2883 describes an extension to SACK which makes TCP more robust in the presence of packet reordering in the network.

In addition, RFC 3449 - TCP Performance Implications of Network Path Asymmetry provides excellent information since a vast majority of Internet connections are asymmetrical.

Further Reading

References

  • RFC 7414, A Roadmap for Transmission Control Protocol (TCP) Specification Documents]], M. Duke, R. Braden, W. Eddy, E. Blanton, A. Zimmermann, February 2015
  • RFC 793, Transmission Control Protocol]], J. Postel, September 1981
  • draft-ietf-tcpm-rfc793bis-10, Transmission Control Protocol Specification, Wesley M. Eddy, July 2018
  • RFC 7323, TCP Extensions for High Performance, D. Borman, B. Braden, V. Jacobson, B. Scheffenegger, September 2014 (obsoletes RFC 1323 from May 1993)
  • RFC 2018, TCP Selective Acknowledgement Options]], M. Mathis, J. Mahdavi, S. Floyd, A. Romanow, October 1996
  • RFC 5681, TCP Congestion Control]], M. Allman, V. Paxson, E. Blanton, September 2009
  • RFC 3449, TCP Performance Implications of Network Path Asymmetry]], H. Balakrishnan, V. Padmanabhan, G. Fairhurst, M. Sooriyabandara, December 2002

There is ample literature on TCP, in particular research literature on its performance and large-scale behaviour.

Juniper has a nice white paper, Supporting Differentiated Service Classes: TCP Congestion Control Mechanisms (PDF format) explaining TCP's congestion control as well as many of the enhancements proposed over the years.

-- Simon Leinen - 2004-10-27 - 2018-07-02 -- Ulrich Schmid - 2005-05-31

Window-Based Transmission

TCP is a sliding-window protocol. The receiver tells the sender the available buffer space at the receiver (TCP header field "window"). The total window size is the minimum of sender buffer size, advertised receiver window size and congestion window size.

The sender can transmit up to this amount of data before having to wait for further buffer update from the receiver and should not have more than this amount of data in transit in the network. The sender must buffer the sent data until it has been ACKed by the receiver, so that the data can be retransmitted if neccessary. For each ACK the sent segment left the window and a new segment fills the window if it fits the (possibly updated) window buffer.

Due to TCP's flow control mechanism, TCP window size can limit the maximum theoretical throughput regardless of the bandwidth of the network path. Using too small a TCP window can degrade the network performance lower than expected and a too large window may have the same problems in case of error recovery.

The TCP window size is the most important parameter for achieving maximum throughput across high-performance networks. To reach the maximum transfer rate, the TCP window should be no smaller than the bandwidth-delay product.

Window size (bytes) => Bandwidth (bytes/sec) x Round-trip time (sec)

Example:

window size: 8192 bytes
round-trip time: 100ms
maximum throughput: < 0.62 Mbit/sec.

References

<--   * Calculating TCP goodput, Javascript-based on-line calculator by V. Reijs, https://webmail.heanet.ie/~vreijs/tcpcalculations.htm -->

-- UlrichSchmid & SimonLeinen - 31 May-07 Jun 2005

Large TCP Windows

In order to achieve high data rates with TCP over "long fat networks", i.e. network paths with a large bandwidth-delay product, TCP sinks (that is, hosts receiving data transported by TCP) must advertise a large TCP receive window (referred to as just 'the window', since there is not an equivalent advertised 'send window').

The window is a 16 bit value (bytes 15 and 16 in the TCP header) and so, in TCP as originally specified, it is limited to a value of 65535 (64K). The receive window sets an upper limit on the sustained throughput achieveable over a TCP connection since it represents the maximum amount of unacknowledged data (in bytes) there can be on the TCP path. Mathematically, achieveable throughput can never be more than WINDOW_SIZE/RTT, so for a trans-Atlantic link, with say an RTT (Round trip Time) of 150ms, the throughput is limited to a maximum of 3.4Mbps. With the emergence of "long fat networks", the limit of 64K bytes (on some systems even just 32K bytes!) was clearly insufficient and so RFC 7323 laid down (amongst other things) a way of scaling the advertised window, such that the 16-bit window value can represent numbers larger than 64K.

RFC 7323 extensions

RFC 7323, TCP Extensions for High Performance, (and formerly RFC 1323) defines several mechanisms to enable high-speed transfers over LFNs: Window Scaling, TCP Timestamps, and Protection Against Wrapped Sequence numbers (PAWS).

The TCP window scaling option increases the maximum window size fom 64KB to 1Gbyte, by shifting the window field left by up to 14. The window scale option is used only during the TCP 3-way handshake (both sides must set the window scale option in their SYN segments if scaling is to be used in either direction).

It is important to use TCP timestamps option with large TCP windows. With the TCP timestamps option, each segment contains a timestamp. The receiver returns that timestamp in each ACK and this allows the sender to estimate the RTT. On the other hand with the TCP timestamps option the problem of wrapped sequence number could be solved (PAWS - Protection Against Wrapped Sequences) which could occur with large windows.

(Auto) Configuration

In the past, most operating systems required manual tuning to use large TCP windows. The OS-specific tuning section contains information on how to to this for a set of operating systems.

Since around 2008-2010, many popular operating systems will use large windows and the necessary protocol options by default, thanks to TCP Buffer Auto-Tuning.

Can TCP Windows ever be too large?

There are several potential issues when TCP Windows are larger than necessary:

  1. When there are many active TCP connection endpoints (sockets) on a system - such as a popular Web or file server - then a large TCP window size will lead to high consumption of system (kernel) memory. This can have a number of negative consequences: The system may run out of buffer space so that no new connections can be opened, or the high occupation of kernel memory (which typically must reside in actual RAM and cannot be "paged out" to disk) can "starve" other processes of access to fast memory (cache and RAM)
  2. Large windows can cause large "bursts" of consecutive segments/packets. When there is a bottleneck in the path - because of a slower link or because of cross-traffic - these bursts will fill up buffers in the network device (router or switch) in front of that bottleneck. The larger these bursts, the higher are the risks that this buffer overflows and causes multiple segments to be dropped. So a large window can lead to "sawtooth" behavior and worse link utilisation than with a just-big-enough window where TCP could operate at a steady rate.

Several methods for automatic TCP buffer tuning have been developed to resolve these issues, some of which have been implemented in (at least) recent Linux versions.

References

-- SimonLeinen - 27 Oct 2004 - 27 Sep 2014

Added:
>
>

TCP Acknowledgements

In TCP's sliding-window scheme, the receiver acknowledges the data it receives, so that the sender can advance the window and send new data. As originally specified, TCP's acknowledgements ("ACKs") are cumulative: the receiver tells the sender how much consecutive data it has received. More recently, selective acknowledgements were introduced to allow more fine-grained acknowledgements of received data.

Delayed Acknowledgements and "ack every other"

RFC 831 first suggested a delayed acknowledgement (Delayed ACK) strategy, where a receiver doesn't always immediately acknowledge segments as it receives them. This recommendation was carried forth and specified in more detail in RFC 1122 and RFC 5681 (formerly known as RFC 2581). RFC 5681 mandates that an acknowledgement be sent for at least every other full-size segment, and that no more than 500ms expire before any segment is acknowledged.

The resulting behavior is that, for longer transfers, acknowledgements are only sent for every two segments received ("ack every other"). This is in order to reduce the amount of reverse flow traffic (and in particular the number of small packets). For transactional (request/response) traffic, the delayed acknowledgement strategy often makes it possible to "piggy-back" acknowledgements on response data segments.

A TCP segment which is only an acknowledgement i.e. has no payload, is termed a pure ack.

Delayed Acknowledgments should be taken into account when doing RTT estimation. As an illustration, see this change note for the Linux kernel from Gavin McCullagh.

Critique on Delayed ACKs

John Nagle nicely explains problems with current implementations of delayed ACK in a comment to a thread on Hacker News:

Here's how to think about that. A delayed ACK is a bet. You're betting that there will be a reply, upon with an ACK can be piggybacked, before the fixed timer runs out. If the fixed timer runs out, and you have to send an ACK as a separate message, you lost the bet. Current TCP implementations will happily lose that bet forever without turning off the ACK delay. That's just wrong.

The right answer is to track wins and losses on delayed and non-delayed ACKs. Don't turn on ACK delay unless you're sending a lot of non-delayed ACKs closely followed by packets on which the ACK could have been piggybacked. Turn it off when a delayed ACK has to be sent.

Duplicate Acknowledgements

A duplicate acknowledgement (DUPACK) is one with the same acknowledgement number as its predecessor - it signifies that the TCP receiver has received a segment newer than the one it was expecting i.e. it has missed a segment. The missed segment might not be lost, it might just be re-ordered. For this reason the TCP sender will not assume data loss on the first DUPACK but (by default) on the third DUPACK, when it will, as per RFC 5681, perform a "Fast Retransmit", by sending the segment again without waiting for a timeout. DUPACKs are never delayed; they are sent immediately the TCP receiver detects an out-of-order segment.

"Quickack mode" in Linux

The TCP implementation in Linux has a special receive-side feature that temporarily disables delayed acknowledgements/ack-every-other when the receiver thinks that the sender is in slow-start mode. This speeds up the slow-start phase when RTT is high. This "quickack mode" can also be explicitly enabled or disabled with setsockopt() using the TCP_QUICKACK option. The Linux modification has been criticized because it makes Linux' slow-start more aggressive than other TCPs' (that follow the SHOULDs in RFC 5681), without sufficient validation of its effects.

"Stretch ACKs"

Techniques such as LRO and GRO (and to some level, Delayed ACKs as well as ACKs that are lost or delayed on the path) can cause ACKs to cover many more than the two segments suggested by the historic TCP standards. Although ACKs in TCP have always been defined as "cumulative" (with the exception of SACKs), some congestion control algorithms have trouble with ACKs that are "stretched" in this way. In January 2015, a patch set was submitted to the Linux kernel network development with the goal to improve the behavior of the Reno and CUBIC congestion control algorithms when faced with stretch ACKs.

References

  • TCP Congestion Control, RFC 5681, M. Allman, V. Paxson, E. Blanton, September 2009
  • RFC 813, Window and Acknowledgement Strategy in TCP, D. Clark, July 1982
  • RFC 1122, Requirements for Internet Hosts -- Communication Layers, R. Braden (Ed.), October 1989

-- TobyRodwell - 2005-04-05
-- SimonLeinen - 2007-01-07 - 2015-01-28

 

TCP Window Scaling Option

TCP as originally specified only allows for windows up to 65536 bytes, because the window-related protocol fields all have a width of 16 bits. This prevents the use of large TCP windows, which are necessary to achieve high transmission rates over paths with a large bandwidth*delay product.

The Window Scaling Option is one of several options defined in RFC 7323, "TCP Extensions for High Performance". It allows TCP to advertise and use a window larger than 65536 bytes. The way this works is that in the initial handshake, the a TCP speaker announces a scaling factor, which is a power of two between 20 (no scaling) and 214, to allow for an effective window of 230 (one Gigabyte). Window Scaling only comes into effect when both ends of the connection advertise the option (even with just a scaling factor of 20).

As explained under the large TCP windows topic, Window Scaling is typically used in connection with the TCP Timestamp option and the PAWS (Protection Against Wrapped Sequence numbers), both defined in RFC 7323.

32K limitations on some systems when Window Scaling is not used

Some systems, notably Linux, limit the usable window in the absence of Window Scaling to 32 Kilobytes. The reasoning behind this is that some (other) systems used in the past were broken by larger windows, because they erroneously interpreted the 16-bit values related to TCP windows as signed, rather than unsigned, integers.

This limitation can have the effect that users (or applications) that request window sizes between 32K and 64K, but don't have Window Scaling enabled, do not actually benefit from the desired window sizes.

The artificial limitation was finally removed from the Linux kernel sources on March 21, 2006. The old behavior (window limited to 32KB when Window Scaling isn't used) can still be enabled through a sysctl tuning parameter, in case one really wants to interoperate with TCP implementations that are broken in the way described above.

Problems with window scaling

Middlebox issues

MiddleBoxes which do not understand window scaling may cause very poor performance, as described in WindowScalingProblems. Windows Vista limits the scaling factor for HTTP transactions to 2 to avoid some of the problems, see WindowsOSSpecific.

Diagnostic issues

The Window Scaling option has an impact on how the offered window in TCP ACKs should be interpreted. This can cause problems when one is faced with incomplete packet traces that lack the initial SYN/SYN+ACK handshake where the options are negotiated: The SEQ/ACK sequence may look incorrect because scaling cannot be taken into account. One effect of this is that tcpgraph or similar analysis tools (included in tools such as WireShark) may produce incoherent results.

References

  • RFC 7323, TCP Extensions for High Performance, D. Borman, B. Braden, V. Jacobson, B. Scheffenegger, September 2014 (obsoletes RFC 1323 from May 1993)
  • git note about the Linux kernel modification that got rid of the 32K limit, R. Jones and D.S. Miller, March 2006. Can also be retrieved as a patch.

-- SimonLeinen - 21 Mar 2006 - 27 September 2014
-- PekkaSavola - 07 Nov 2006
-- AlexGall - 31 Aug 2007 (added reference to scaling limitiation in Windows Vista)

TCP Flow Control

Note: This topic describes the Reno enhancement of classical "Van Jacobson" or Tahoe congestion control. There have been many suggestions for improving this mechanism - see the topic on high-speed TCP variants.

TCP flow control and window size adjustment is mainly based on two key mechanism: Slow Start and Additive Increase/Multiplicative Decrease (AIMD), also known as Congestion Avoidance. (RFC 793 and RFC 5681)

Slow Start

To avoid that a starting TCP connection floods the network, a Slow Start mechanism was introduced in TCP. This mechanism effectively probes to find the available bandwidth.

In addition to the window advertised by the receiver, a Congestion Window (cwnd) value is used and the effective window size is the lesser of the two. The starting value of the cwnd window is set initially to a value that has been evolving over the years, the TCP Initial Window. After each acknowledgment, the cwnd window is increased by one MSS. By this algorithm, the data rate of the sender doubles each round-trip time (RTT) interval (actually, taking into account Delayed ACKs, rate increases by 50% every RTT). For a properly implemented version of TCP this increase continues until:

  • the advertised window size is reached
  • congestion (packet loss) is detected on the connection.
  • there is no traffic waiting to take advantage of an increased window (i.e. cwnd should only grow if it needs to)

When congestion is detected, the TCP flow-control mode is changed from Slow Start to Congestion Avoidance. Note that some TCP implementations maintain cwnd in units of bytes, while others use units of full-sized segments.

Congestion Avoidance

Once congestion is detected (through timeout and/or duplicate ACKs), the data rate is reduced in order to let the network recover.

Slow Start uses an exponential increase in window size and thus also in data rate. Congestion Avoidance uses a linear growth function (additive increase). This is achieved by introducing - in addition to the cwnd window - a slow start threshold (ssthresh).

As long as cwnd is less than ssthresh, Slow Start applies. Once ssthresh is reached, cwnd is increased by at most one segment per RTT. The cwnd window continues to open with this linear rate until a congestion event is detected.

When congestion is detected, ssthresh is set to half the cwnd (or to be strictly accurate, half the "Flight Size". This distinction is important if the implementation lets cwnd grow beyond rwnd (the receiver's declared window)). cwnd is either set to 1 if congestion was signalled by a timeout, forcing the sender to enter Slow Start, or to ssthresh if congestion was signalled by duplicate ACKs and the Fast Recovery algorithm has terminated. In either case, once the sender enters Congestion Avoidance, its rate has been reduced to half the value at the time of congestion. This multiplicative decrease causes the cwnd to close exponentially with each detected loss event.

Fast Retransmit

In Fast Retransmit, the arrival of three duplicate ACKs is interpreted as packet loss, and retransmission starts before the retransmission timer (RTO) expires.

The missing segment will be retransmitted immediately without going through the normal retransmission queue processing. This improves performance by eliminating delays that would suspend effective data flow on the link.

Fast Recovery

Fast Recovery is used to react quickly to a single packet loss. In Fast recovery, the receipt of 3 duplicate ACKs, while being taken to mean a loss of a segment, does not result in a full Slow Start. This is because obviously later segments got through, and hence congestion is not stopping everything. In fast recovery, ssthresh is set to half of the current send window size, the missing segment is retransmitted (Fast Retransmit) and cwnd is set to ssthresh plus three segments. Each additional duplicate ACK indicates that one segment has left the network at the receiver and cwnd is increased by one segment to allow the transmission of another segment if allowed by the new cwnd. When an ACK is received for new data, cwmd is reset to the ssthresh, and TCP enters congestion avoidance mode.

References

-- UlrichSchmid - 07 Jun 2005 -- SimonLeinen - 27 Jan 2006 - 23 Jun 2011

High-Speed TCP Variants

There have been numerous ideas for improving TCP over the years. Some of those ideas have been adopted by mainstream operations (after thorough review). Recently there has been an uptake in work towards improving TCP's behavior with LongFatNetworks. It has been proven that the current congestion control algorithms limit the efficiency in network resource utilization. The various types of new TCP implementations introduce changes in defining the size of congestion window. Some of them require extra feedback from the network. Generally, all such congestion control protocols are divided as follows.

An orthogonal technique for improving TCP's performance is automatic buffer tuning.

Comparative Studies

All papers about individual TCP-improvement proposals contain comparisons against older TCP to quantify the improvement. There are several studies that compare the performance of the various new TCP variants, including

Other References

  • Gigabit TCP, G. Huston, The Internet Protocol Journal, Vol. 9, No. 2, June 2006. Contains useful descriptions of many modern TCP enhancements.
  • Faster, G. Huston, June 2005. This article from ISP Column looks at various approaches to TCP transfers at very high rates.
  • Congestion Control in the RFC Series, M. Welzl, W. Eddy, July 2006, Internet-Draft (work in progress)
  • ICCRG Wiki, IRTF (Internet Research Task Force) ICCRG (Internet Congestion Control Research Group), includes bibliography on congestion control.
  • RFC 6077: Open Research Issues in Internet Congestion Control, D. Papadimitriou (Ed.), M. Welzl, M. Scharf, B. Briscoe, February 2011

Related Work

  • PFLDnet (International Workshop on Protocols for Fast Long-Distance Networks): 2003, 2004, 2005, 2006, 2007, 2009, 2010. (The 2008 pages seem to have been removed from the Net.)

-- Simon Leinen - 2005-05-28–2016-09-18 -- Orla Mc Gann - 2005-10-10 -- Chris Welti - 2008-02-25

Revision 142006-04-02 - SimonLeinen

Line: 1 to 1
 
META TOPICPARENT name="AutoToc"

Unrolled PERTKB Contents

Line: 22 to 22
 

TCP Window Scaling Option

TCP as originally specified only allows for windows up to 65536 bytes, because the window-related protocol fields all have a width of 16 bits. This prevents the use of large TCP windows, which are necessary to achieve high transmission rates over paths with a large bandwidth*delay product.

The Window Scaling Option is one of several options defined in RFC 7323, "TCP Extensions for High Performance". It allows TCP to advertise and use a window larger than 65536 bytes. The way this works is that in the initial handshake, the a TCP speaker announces a scaling factor, which is a power of two between 20 (no scaling) and 214, to allow for an effective window of 230 (one Gigabyte). Window Scaling only comes into effect when both ends of the connection advertise the option (even with just a scaling factor of 20).

As explained under the large TCP windows topic, Window Scaling is typically used in connection with the TCP Timestamp option and the PAWS (Protection Against Wrapped Sequence numbers), both defined in RFC 7323.

32K limitations on some systems when Window Scaling is not used

Some systems, notably Linux, limit the usable window in the absence of Window Scaling to 32 Kilobytes. The reasoning behind this is that some (other) systems used in the past were broken by larger windows, because they erroneously interpreted the 16-bit values related to TCP windows as signed, rather than unsigned, integers.

This limitation can have the effect that users (or applications) that request window sizes between 32K and 64K, but don't have Window Scaling enabled, do not actually benefit from the desired window sizes.

The artificial limitation was finally removed from the Linux kernel sources on March 21, 2006. The old behavior (window limited to 32KB when Window Scaling isn't used) can still be enabled through a sysctl tuning parameter, in case one really wants to interoperate with TCP implementations that are broken in the way described above.

Problems with window scaling

Middlebox issues

MiddleBoxes which do not understand window scaling may cause very poor performance, as described in WindowScalingProblems. Windows Vista limits the scaling factor for HTTP transactions to 2 to avoid some of the problems, see WindowsOSSpecific.

Diagnostic issues

The Window Scaling option has an impact on how the offered window in TCP ACKs should be interpreted. This can cause problems when one is faced with incomplete packet traces that lack the initial SYN/SYN+ACK handshake where the options are negotiated: The SEQ/ACK sequence may look incorrect because scaling cannot be taken into account. One effect of this is that tcpgraph or similar analysis tools (included in tools such as WireShark) may produce incoherent results.

References

  • RFC 7323, TCP Extensions for High Performance, D. Borman, B. Braden, V. Jacobson, B. Scheffenegger, September 2014 (obsoletes RFC 1323 from May 1993)
  • git note about the Linux kernel modification that got rid of the 32K limit, R. Jones and D.S. Miller, March 2006. Can also be retrieved as a patch.

-- SimonLeinen - 21 Mar 2006 - 27 September 2014
-- PekkaSavola - 07 Nov 2006
-- AlexGall - 31 Aug 2007 (added reference to scaling limitiation in Windows Vista)

TCP Flow Control

Note: This topic describes the Reno enhancement of classical "Van Jacobson" or Tahoe congestion control. There have been many suggestions for improving this mechanism - see the topic on high-speed TCP variants.

TCP flow control and window size adjustment is mainly based on two key mechanism: Slow Start and Additive Increase/Multiplicative Decrease (AIMD), also known as Congestion Avoidance. (RFC 793 and RFC 5681)

Slow Start

To avoid that a starting TCP connection floods the network, a Slow Start mechanism was introduced in TCP. This mechanism effectively probes to find the available bandwidth.

In addition to the window advertised by the receiver, a Congestion Window (cwnd) value is used and the effective window size is the lesser of the two. The starting value of the cwnd window is set initially to a value that has been evolving over the years, the TCP Initial Window. After each acknowledgment, the cwnd window is increased by one MSS. By this algorithm, the data rate of the sender doubles each round-trip time (RTT) interval (actually, taking into account Delayed ACKs, rate increases by 50% every RTT). For a properly implemented version of TCP this increase continues until:

  • the advertised window size is reached
  • congestion (packet loss) is detected on the connection.
  • there is no traffic waiting to take advantage of an increased window (i.e. cwnd should only grow if it needs to)

When congestion is detected, the TCP flow-control mode is changed from Slow Start to Congestion Avoidance. Note that some TCP implementations maintain cwnd in units of bytes, while others use units of full-sized segments.

Congestion Avoidance

Once congestion is detected (through timeout and/or duplicate ACKs), the data rate is reduced in order to let the network recover.

Slow Start uses an exponential increase in window size and thus also in data rate. Congestion Avoidance uses a linear growth function (additive increase). This is achieved by introducing - in addition to the cwnd window - a slow start threshold (ssthresh).

As long as cwnd is less than ssthresh, Slow Start applies. Once ssthresh is reached, cwnd is increased by at most one segment per RTT. The cwnd window continues to open with this linear rate until a congestion event is detected.

When congestion is detected, ssthresh is set to half the cwnd (or to be strictly accurate, half the "Flight Size". This distinction is important if the implementation lets cwnd grow beyond rwnd (the receiver's declared window)). cwnd is either set to 1 if congestion was signalled by a timeout, forcing the sender to enter Slow Start, or to ssthresh if congestion was signalled by duplicate ACKs and the Fast Recovery algorithm has terminated. In either case, once the sender enters Congestion Avoidance, its rate has been reduced to half the value at the time of congestion. This multiplicative decrease causes the cwnd to close exponentially with each detected loss event.

Fast Retransmit

In Fast Retransmit, the arrival of three duplicate ACKs is interpreted as packet loss, and retransmission starts before the retransmission timer (RTO) expires.

The missing segment will be retransmitted immediately without going through the normal retransmission queue processing. This improves performance by eliminating delays that would suspend effective data flow on the link.

Fast Recovery

Fast Recovery is used to react quickly to a single packet loss. In Fast recovery, the receipt of 3 duplicate ACKs, while being taken to mean a loss of a segment, does not result in a full Slow Start. This is because obviously later segments got through, and hence congestion is not stopping everything. In fast recovery, ssthresh is set to half of the current send window size, the missing segment is retransmitted (Fast Retransmit) and cwnd is set to ssthresh plus three segments. Each additional duplicate ACK indicates that one segment has left the network at the receiver and cwnd is increased by one segment to allow the transmission of another segment if allowed by the new cwnd. When an ACK is received for new data, cwmd is reset to the ssthresh, and TCP enters congestion avoidance mode.

References

-- UlrichSchmid - 07 Jun 2005 -- SimonLeinen - 27 Jan 2006 - 23 Jun 2011

High-Speed TCP Variants

There have been numerous ideas for improving TCP over the years. Some of those ideas have been adopted by mainstream operations (after thorough review). Recently there has been an uptake in work towards improving TCP's behavior with LongFatNetworks. It has been proven that the current congestion control algorithms limit the efficiency in network resource utilization. The various types of new TCP implementations introduce changes in defining the size of congestion window. Some of them require extra feedback from the network. Generally, all such congestion control protocols are divided as follows.

An orthogonal technique for improving TCP's performance is automatic buffer tuning.

Comparative Studies

All papers about individual TCP-improvement proposals contain comparisons against older TCP to quantify the improvement. There are several studies that compare the performance of the various new TCP variants, including

Other References

  • Gigabit TCP, G. Huston, The Internet Protocol Journal, Vol. 9, No. 2, June 2006. Contains useful descriptions of many modern TCP enhancements.
  • Faster, G. Huston, June 2005. This article from ISP Column looks at various approaches to TCP transfers at very high rates.
  • Congestion Control in the RFC Series, M. Welzl, W. Eddy, July 2006, Internet-Draft (work in progress)
  • ICCRG Wiki, IRTF (Internet Research Task Force) ICCRG (Internet Congestion Control Research Group), includes bibliography on congestion control.
  • RFC 6077: Open Research Issues in Internet Congestion Control, D. Papadimitriou (Ed.), M. Welzl, M. Scharf, B. Briscoe, February 2011

Related Work

  • PFLDnet (International Workshop on Protocols for Fast Long-Distance Networks): 2003, 2004, 2005, 2006, 2007, 2009, 2010. (The 2008 pages seem to have been removed from the Net.)

-- Simon Leinen - 2005-05-28–2016-09-18 -- Orla Mc Gann - 2005-10-10 -- Chris Welti - 2008-02-25

Changed:
<
<

Explicit Congestion Protocol (XCP)

Explicit Congestion Protocol (XCP) is a new congestion control protocol developed by Dina Katabi from MIT Computer Science & Artificial Intelligence Lab. It extracts information about congestion from routers along the path between endpoints. It is more complicated to implement than other proposed Internet congestion control protocols.

The XCP-capable routers make a fair per-flow bandwidth allocation without carrying per-flow congestion state in packets. In order to request for the desired throughput, the sender sends a congestion header (XCP packet) located between the IP and transport headers. It enables the sender to learn about the bottleneck on the path from the sender to the receiver in a single round trip.

In order to increase congestion window size for TCP connection, the sender require feedback from the network, informing about the maximum, available throughput along the path for injecting data into the network. The routers update such information in the congestion header as it moves from the sender to the receiver. The main task for the receiver is to copy the network feedback into outgoing packets belonging to the same bidirectional flow.

The congestion header consists of four fields as follows:

  • Round-Trip Time (RTT): The current round-trip time for the flow.
  • Throughput: The throughput used currently by the sender.
  • Delta_Throughput: The value by which the sender would like to change its throughput. This field is updated by the routers along the path. If one of the router sets a negative value, it means that the sender must slow down.
  • Reverse_Feedback: This is the value, which is copied from the Delta_Throughput field and returned by the receiver to the sender, it contains a maximum feedback allocation from the network.

The router with implemented XCP maintains two control algorithms, executed periodically. First of them, implemented in the congestion controller is responsible for specifying maximum use of the outbound port. The fairness controller is responsible for fairly distribution of throughput to flows sharing the link.

Benefits of XCP

  • it achieves fairness and rapidly converges to fairness,
  • it achieves maximum link utilization (better bottleneck link utilization),
  • it learns about the bottleneck much more rapidly,
  • it is potentially applicable to any transport protocol,
  • is more stable at long-distance connections with larger RTT.

Deployment issues and issues with XCP

  • has to be tested in real network environments,
  • it should describe solutions for tunneling protocols,
  • has to be supported by routers along the path,
  • the work load for the routers is higher.

References

-- WojtekSronek - 08 Jul 2005

>
>

Explicit Control Protocol (XCP)

The eXplicit Control Protocol (XCP) is a new congestion control protocol developed by Dina Katabi from MIT Computer Science & Artificial Intelligence Lab. It extracts information about congestion from routers along the path between endpoints. It is more complicated to implement than other proposed Internet congestion control protocols.

XCP-capable routers make a fair per-flow bandwidth allocation without carrying per-flow congestion state in packets. In order to request for the desired throughput, the sender sends a congestion header (XCP packet) located between the IP and transport headers. It enables the sender to learn about the bottleneck on the path from the sender to the receiver in a single round trip.

In order to increase congestion window size for TCP connection, the sender require feedback from the network, informing about the maximum, available throughput along the path for injecting data into the network. The routers update such information in the congestion header as it moves from the sender to the receiver. The main task for the receiver is to copy the network feedback into outgoing packets belonging to the same bidirectional flow.

The congestion header consists of four fields as follows:

  • Round-Trip Time (RTT): The current round-trip time for the flow.
  • Throughput: The throughput used currently by the sender.
  • Delta_Throughput: The value by which the sender would like to change its throughput. This field is updated by the routers along the path. If one of the router sets a negative value, it means that the sender must slow down.
  • Reverse_Feedback: This is the value, which is copied from the Delta_Throughput field and returned by the receiver to the sender, it contains a maximum feedback allocation from the network.

A router that implements XCP maintains two control algorithms, executed periodically. The first of them, implemented in the congestion controller, is responsible for specifying maximum use of the outbound port. The fairness controller is responsible for fair distribution of throughput to flows sharing the link.

Benefits of XCP

  • it achieves fairness and rapidly converges to fairness,
  • it achieves maximum link utilization (better bottleneck link utilization),
  • it learns about the bottleneck much more rapidly,
  • it is potentially applicable to any transport protocol,
  • is more stable at long-distance connections with larger RTT.

Deployment issues and issues with XCP

  • has to be tested in real network environments,
  • it should describe solutions for tunneling protocols,
  • has to be supported by routers along the path,
  • the work load for the routers is higher.

References

-- Wojtek Sronek -2005-07-08 -- Simon Leinen - 2006-04-01–2016-05-30

 

Source Quench

Source Quench is an ICMP based mechanism used by network devices to inform data sender that the packets can not be forwarded due to buffers overload. When the message is received by a TCP sender, that sender should decrease its send window to the respective destination in order to limit outgoing traffic. The ICMP Source Quench usage is specified in RFC 896 - Congestion Control in IP/TCP Internetworks. The currently used standard specified in RFC 1812 - Requirements for IP Version 4 Routers says that routers should not originate this message, and therefore Source Quench should not be used any more.

Problems with Source Quench

There are several reasons why Source Quench has fallen out of favor as a congestion control mechanism over the years:

  1. Source Quench messages can be lost on their way to the sender (due to congestion on the return path, filtering/return routing issues etc.). A congestion control mechanism should be robust against these sorts of problems.
  2. A Source Quench message carries very little information per packet, namely only that some amount of congestion was sensed at the gateway (router) that sent it.
  3. Source Quench messages, like all ICMP messages, are expensive for a router to generate. This is bad because the congestion control mechanism could contribute additional congestion, if router processing resources become a bottleneck.
  4. Source Quench messages could be abused by a malevolent third party to slow down connections, causing denial of service.

In effect, ICMP Source Quench messages are almost never generated on the Internet today, and would be ignored almost everywhere if they still existed.

References

  • Congestion Control in IP/TCP Internetworks, RFC 896, J. Nagle, January 1984
  • Something a Host Could Do with Source Quench: The Source Quench Introduced Delay (SQuID), RFC 1016, W. Prue and J. Postel, July 1987
  • Requirements for IP Version 4 Routers, RFC 1812, F. Baker (Ed.), June 1995
  • IETF-discuss message with notes on the history of Source Quench, F. Baker, Feb. 2007 - in archive

-- WojtekSronek - 05 Jul 2005

Explicit Congestion Notification (ECN)

The Internet's end-to-end rate control schemes (notably the TransmissionControlProtocol (TCP)) traditionally have to rely on packet loss as the prime indicator of congestion (queueing delay is also used for congestion feedback, although mostly implicitly, except in newer mechanisms such as TCP FAST).

Both loss and delay are implicit signals of congestion. The alternative is to send explicit congestion signals.

ICMP Source Quench

Such a signal was indeed part of the original IP design, in the form of the "Source Quench" ICMP (Internet Control Message Protocol) message. A router experiencing congestion could send ICMP Source Quench messages towards a source, in order to tell that source to send more slowly. This method was deprecated by RFC1995 in June 1812. Oops, I meant by RFC1812 ("Requirements for IP Version 4 Routers") in June 1995. The biggest problem with ICMP Source Quench are:

  • that the mechanism causes more traffic to be generated in a situation of congestion (although in the other direction)
  • that, when ICMP Source Quench messages are lost, it fails to slow down the sender.

The ECN Proposal

The new ECN mechanism consists of two components:

  • Two new "ECN" bits in the former TOS field of the IP header:
    • The "ECN-Capable Transport" (ECT) bit must only be set for packets controlled by ECN-aware transports
    • The "Congestion Experienced" (CE) bit can be set by a router if
      • the router has detected congestion on the outgoing link
      • and the ECT bit is set.

  • Transport-specific protocol extensions which communicate the ECN signal back from the receiver to the sender. For TransmissionControlProtocol, this takes the form of two new flags in the TCP header, ECN-Echo (ECE) and Congestion Window Reduced (CWR). A similar mechanism has been included in SCTP.

ECN works as follows. When a transport supports ECN, it sends IP packets with ECT (ECN-Capable Transport) set. Then, when there is congestion, a router will set the CE (Congestion Experienced) bit in some of these packets. The receiver notices this, and sends a signal back to the sender (by setting the ECE flag). The sender then reduces its sending rate, as if it had detected the loss of a single packet, and sets the CWR flag so as to inform the receiver of this action.

(Note that the two-bit ECN field in the IP header has been redefined in the current ECN RFC (RFC3168), so that "ECT" and "CE" are no longer actual bits. But the old definition is somewhat easier to understand. If you want to know how these "conceptual" bits are encoded, please check out RFC 3168.)

Benefits of ECN

ECN provides two significant benefits:

  • ECN-aware transports can properly adapt their rates to congestion without requiring packet loss
  • Congestion feedback can be quicker with ECN, because detecting a dropped packet requires a timeout.

For more detailed information, see the Internet-Draft The Benefits of using Explicit Congestion Notification (ECN) mentioned in the References below.

Deployment Issues with ECN

ECN requires AQM, which isn't widely deployed

ECN requires routers to use an Active Queue Management (AQM) mechanism such as Random Early Detection (RED). In addition, routers have to be able to mark eligible (ECT) packets with the CE bit when the AQM mechanism notices congestion. RED is widely implemented on routers today, although it is rarely activated in actual networks.

ECN must be added to routers' forwarding paths

The capability to ECN-mark packets can be added to CPU- or Network-Processor-based routing platforms relatively easily - Cisco's CPU-based routers such as the 7200/7500 routers support this with newer software, for example, but if queueing/forwarding is performed by specialized hardware (ASICs), this function has to be designed into the hardware from the start. Therefore, most of today's high-speed routers cannot easily support ECN to my knowledge.

ECN "Blackholing"

Another issue is that attempts to use ECN can cause issues with certain "middlebox" devices such as firewalls or load balancers, which break connectivity when unexpected TCP flags (or, more rarely, unexpected IP TOS values) are encountered. The original ECN RFC (RFC 2481) didn't handle this gracefully, so activating ECN on hosts that implement this version caused much frustration because of "hanging" connections. RFC 3168 proposes a mechanism to deal with ECN-unfriendly networks, but that hasn't been widely implemented yet. In particular, the Linux ECN implementation doesn't seem to implement it as of November 2007 (Linux 2.6.23).

See Floyd's ECN Problems page for more.

How to Activate ECN

On Linux hosts

ECN for both IPv4 and IPv6 is controlled by a sysctl parameter net.ipv4.tcp_ecn. According to ip-sysctl.txt, it may have one of three values: 0 means "never do ECN", 1 means "actively try to negotiate ECN", 2 means "do ECN when asked for". To activate on a running system:

echo 1 > /proc/sys/net/ipv4/tcp_ecn

To make persistent, add the following line to /etc/sysctl.conf:

net.ipv4.tcp_ecn=1

On Mac OS X

In a message from 23 March 2015 to the iccrg mailing list, Stuart Cheshire explains how to activate ECN on Mac OS X. The following commands activate it on the running system:

sysctl -w net.inet.tcp.ecn_negotiate_in=1
sysctl -w net.inet.tcp.ecn_initiate_out=1

To make it persistent, add the following to /etc/sysctl.conf:

net.inet.tcp.ecn_initiate_out=1
net.inet.tcp.ecn_negotiate_in=1

Suggested Improvements

For situations where ECN is not explicit enough, RFC 7514 (Really Explicit Congestion Notification (RECN) suggests a protocol where senders can be told to back off in increasingly explicit terms.

References

-- Simon Leinen - 2005-01-07 - 2018-03-22

Rate Control Protocol (RCP)

Developed by Nandita Dukkipati and Nick McKeown in Stanford University, RCP aims to emulate processor sharing(PS) over a broad range of operating conditions. TCP's congestion control algorithm, and most of the other proposed alternatives such as ExplicitControlProtocol, try to emulate processor sharing by giving each competing flow an equal share of a bottleneck link. They emulate PS well in a static scenario when all flows are long-lived, but in scenarios where flows are short-lived, arrive randomly and have a finite amount of data to send, as is the case in today's Internet, they do not perform as well.

In RCP a router assigns a single rate to all flows that pass through it. The router does not keep flow-state nor does it do per-packet calculations. The flow rate is picked by routers based on the current queue occupancy and the aggregate input traffic rate.

The Algorithm

The basic RCP algorithm is as follows:

  1. Every router maintains a single fair-share rate, R(t), that it offers to all flows. It updates R(t) approximately once every RTT.
  2. Every packet header carries a rate field, Rp. When transmitted by the source, Rp = infinity. When a router receives a packet, if R(t) at the router is smaller than Rp, then Rp <- R(t); otherwise it's unchanged. The destination copies Rp into the acknowledgement packets, so as to notify the source. The packet header also carries an RTT field, RTTp; where RTTp is the source's current estimate of the RTT for the flow. When a router receives a packet it uses RTTp to update its moving average of the RTT of the flows passing through it, d0.
  3. The source transmits at a rate Rp, which correspondsto the smallest offered rate along the path

Papers

Processor Sharing Flows in the Internet. N. Dukkipati and N. McKeown Stanford University High Performance Networking Group Technical Report TR04-HPNG-061604, June 2004

-- OrlaMcGann - 11 Oct 2005

Revision 132006-03-29 - SimonLeinen

Line: 1 to 1
 
META TOPICPARENT name="AutoToc"

Unrolled PERTKB Contents

Line: 19 to 19
 

TCP (Transmission Control Protocol)

The Transmission Control Protocol (TCP) is the prevalent transport protocol used on the Internet today. It uses window-based transmission to provide the service of a reliable byte stream, and adapts the rate of transfer to the state (of congestion) of the network and the receiver. Basic mechanisms include:

  • Segments that fit into IP packets, into which the byte-stream is split by the sender,
  • a checksum for each segment,
  • a Window, which bounds the amount of data "in flight" between the sender and the receiver,
  • Acknowledgements, by which the receiver tells the sender about segments that were received successfully,
  • a flow-control mechanism, which regulates data rate as a function of network and receiver conditions.

Originally specified in September 1981 RFC 793, TCP was clarified, refined and extended in many documents. Perhaps most importantly, congestion control was introduced in "TCP Tahoe" in 1988, described in Van Jacobson's 1988 SIGCOMM article on "Congestion Avoidance and Control". It can be said that TCP's congestion control is what keeps the Internet working when links are overloaded. In today's Internet, the enhanced "Reno" variant of congestion control is probably the most widespread.

RFC 7323 (formerly RFC 1323) specifies a set of "TCP Extensions for High Performance", namely the Window Scaling Option, which provides for much larger windows than the original 64K, the Timestamp Option and the PAWS (Protection Against Wrapped Sequence numbers) mechanism. These extensions are supported by most contemporary TCP stacks, although they frequently must be activated explicitly (or implicitly by configuring large TCP windows).

Another widely implemented performance enhancement to TCP are selective acknowledgements (SACK, RFC 2018). In TCP as originally specified, the acknowledgements ("ACKs") sent by a receiver were always "cumulative", that is, they specified the last byte of the part of the stream that was completely received. This is advantageous with large TCP windows, in particular where chances are high that multiple segments within a single window are lost. RFC 2883 describes an extension to SACK which makes TCP more robust in the presence of packet reordering in the network.

In addition, RFC 3449 - TCP Performance Implications of Network Path Asymmetry provides excellent information since a vast majority of Internet connections are asymmetrical.

Further Reading

References

  • RFC 7414, A Roadmap for Transmission Control Protocol (TCP) Specification Documents]], M. Duke, R. Braden, W. Eddy, E. Blanton, A. Zimmermann, February 2015
  • RFC 793, Transmission Control Protocol]], J. Postel, September 1981
  • draft-ietf-tcpm-rfc793bis-10, Transmission Control Protocol Specification, Wesley M. Eddy, July 2018
  • RFC 7323, TCP Extensions for High Performance, D. Borman, B. Braden, V. Jacobson, B. Scheffenegger, September 2014 (obsoletes RFC 1323 from May 1993)
  • RFC 2018, TCP Selective Acknowledgement Options]], M. Mathis, J. Mahdavi, S. Floyd, A. Romanow, October 1996
  • RFC 5681, TCP Congestion Control]], M. Allman, V. Paxson, E. Blanton, September 2009
  • RFC 3449, TCP Performance Implications of Network Path Asymmetry]], H. Balakrishnan, V. Padmanabhan, G. Fairhurst, M. Sooriyabandara, December 2002

There is ample literature on TCP, in particular research literature on its performance and large-scale behaviour.

Juniper has a nice white paper, Supporting Differentiated Service Classes: TCP Congestion Control Mechanisms (PDF format) explaining TCP's congestion control as well as many of the enhancements proposed over the years.

-- Simon Leinen - 2004-10-27 - 2018-07-02 -- Ulrich Schmid - 2005-05-31

Window-Based Transmission

TCP is a sliding-window protocol. The receiver tells the sender the available buffer space at the receiver (TCP header field "window"). The total window size is the minimum of sender buffer size, advertised receiver window size and congestion window size.

The sender can transmit up to this amount of data before having to wait for further buffer update from the receiver and should not have more than this amount of data in transit in the network. The sender must buffer the sent data until it has been ACKed by the receiver, so that the data can be retransmitted if neccessary. For each ACK the sent segment left the window and a new segment fills the window if it fits the (possibly updated) window buffer.

Due to TCP's flow control mechanism, TCP window size can limit the maximum theoretical throughput regardless of the bandwidth of the network path. Using too small a TCP window can degrade the network performance lower than expected and a too large window may have the same problems in case of error recovery.

The TCP window size is the most important parameter for achieving maximum throughput across high-performance networks. To reach the maximum transfer rate, the TCP window should be no smaller than the bandwidth-delay product.

Window size (bytes) => Bandwidth (bytes/sec) x Round-trip time (sec)

Example:

window size: 8192 bytes
round-trip time: 100ms
maximum throughput: < 0.62 Mbit/sec.

References

<--   * Calculating TCP goodput, Javascript-based on-line calculator by V. Reijs, https://webmail.heanet.ie/~vreijs/tcpcalculations.htm -->

-- UlrichSchmid & SimonLeinen - 31 May-07 Jun 2005

Large TCP Windows

In order to achieve high data rates with TCP over "long fat networks", i.e. network paths with a large bandwidth-delay product, TCP sinks (that is, hosts receiving data transported by TCP) must advertise a large TCP receive window (referred to as just 'the window', since there is not an equivalent advertised 'send window').

The window is a 16 bit value (bytes 15 and 16 in the TCP header) and so, in TCP as originally specified, it is limited to a value of 65535 (64K). The receive window sets an upper limit on the sustained throughput achieveable over a TCP connection since it represents the maximum amount of unacknowledged data (in bytes) there can be on the TCP path. Mathematically, achieveable throughput can never be more than WINDOW_SIZE/RTT, so for a trans-Atlantic link, with say an RTT (Round trip Time) of 150ms, the throughput is limited to a maximum of 3.4Mbps. With the emergence of "long fat networks", the limit of 64K bytes (on some systems even just 32K bytes!) was clearly insufficient and so RFC 7323 laid down (amongst other things) a way of scaling the advertised window, such that the 16-bit window value can represent numbers larger than 64K.

RFC 7323 extensions

RFC 7323, TCP Extensions for High Performance, (and formerly RFC 1323) defines several mechanisms to enable high-speed transfers over LFNs: Window Scaling, TCP Timestamps, and Protection Against Wrapped Sequence numbers (PAWS).

The TCP window scaling option increases the maximum window size fom 64KB to 1Gbyte, by shifting the window field left by up to 14. The window scale option is used only during the TCP 3-way handshake (both sides must set the window scale option in their SYN segments if scaling is to be used in either direction).

It is important to use TCP timestamps option with large TCP windows. With the TCP timestamps option, each segment contains a timestamp. The receiver returns that timestamp in each ACK and this allows the sender to estimate the RTT. On the other hand with the TCP timestamps option the problem of wrapped sequence number could be solved (PAWS - Protection Against Wrapped Sequences) which could occur with large windows.

(Auto) Configuration

In the past, most operating systems required manual tuning to use large TCP windows. The OS-specific tuning section contains information on how to to this for a set of operating systems.

Since around 2008-2010, many popular operating systems will use large windows and the necessary protocol options by default, thanks to TCP Buffer Auto-Tuning.

Can TCP Windows ever be too large?

There are several potential issues when TCP Windows are larger than necessary:

  1. When there are many active TCP connection endpoints (sockets) on a system - such as a popular Web or file server - then a large TCP window size will lead to high consumption of system (kernel) memory. This can have a number of negative consequences: The system may run out of buffer space so that no new connections can be opened, or the high occupation of kernel memory (which typically must reside in actual RAM and cannot be "paged out" to disk) can "starve" other processes of access to fast memory (cache and RAM)
  2. Large windows can cause large "bursts" of consecutive segments/packets. When there is a bottleneck in the path - because of a slower link or because of cross-traffic - these bursts will fill up buffers in the network device (router or switch) in front of that bottleneck. The larger these bursts, the higher are the risks that this buffer overflows and causes multiple segments to be dropped. So a large window can lead to "sawtooth" behavior and worse link utilisation than with a just-big-enough window where TCP could operate at a steady rate.

Several methods for automatic TCP buffer tuning have been developed to resolve these issues, some of which have been implemented in (at least) recent Linux versions.

References

-- SimonLeinen - 27 Oct 2004 - 27 Sep 2014

Deleted:
<
<

Automatic Tuning of TCP Buffers

Note: This mechanism is sometimes referred to as "Dynamic Right-Sizing" (DRS).

The issues mentioned under "Large TCP Windows" are arguments in favor of "buffer auto-tuning", a promising but relatively new approach to better TCP performance in operating systems. See the TCP auto-tuning zoo reference for a description of some approaches.

Microsoft introduced (receive-side) buffer auto-tuning in Windows Vista. This implementation is explained in a TechNet Magazine "Cable Guy" article.

FreeBSD introduced buffer auto-tuning as part of its 7.0 release.

Mac OS X introduced buffer auto-tuning in release 10.5.

Linux auto-tuning details

Some automatic buffer tuning is implemented in Linux 2.4 (sender-side), and Linux 2.6 implements it for both the send and receive directions.

In a post to the web100-discuss mailing list, John Heffner describes the Linux 2.6.16 (March 2006) Linux implementation as follows:

For the sender, we explored separating the send buffer and retransmit queue, but this has been put on the back burner. This is a cleaner approach, but is not necessary to achieve good performance. What is currently implemented in Linux is essentially what is described in Semke '98, but without the max-min fair sharing. When memory runs out, Linux implements something more like congestion control for reducing memory. It's not clear that this is well-behaved, and I'm not aware of any literature on this. However, it's rarely used in practice.

For the receiver, we took an approach similar to DRS, but not quite the same. RTT is measured with timestamps (when available), rather than using a bounding function. This allows it to track a rise in RTT (for example, due to path change or queuing). Also, a subtle but important difference is that receiving rate is measured by the amount of data consumed by the application, not data received by TCP.

Matt Mathis reports on the end2end-interest mailing list (26/07/06):

Linux 2.6.17 now has sender and receiver side autotuning and a 4 MB DEFAULT maximum window size. Yes, by default it negotiates a TCP window scale of 7.
4 MB is sufficient to support about 100 Mb/s on a 300 ms path or 1 Gb/s on a 30 ms path, assuming you have enough data and an extremely clean (loss-less) network.

References

  • TCP auto-tuning zoo, Web page by Tom Dunegan of Oak Ridge National Laboratory, PDF
  • A Comparison of TCP Automatic Tuning Techniques for Distributed Computing, E. Weigle, W. Feng, 2002, PDF
  • Automatic TCP Buffer Tuning, J. Semke, J. Mahdavi, M. Mathis, SIGCOMM 1998, PS
  • Dynamic Right-Sizing in TCP, M. Fisk, W. Feng, Proc. of the Los Alamos Computer Science Institute Symposium, October 2001, PDF
  • Dynamic Right-Sizing in FTP (drsFTP): Enhancing Grid Performance in User-Space, M.K. Gardner, Wu-chun Feng, M. Fisk, July 2002, PDF. This paper describes an implementation of buffer-tuning at application level for FTP, i.e. outside of the kernel.
  • Socket Buffer Auto-Sizing for High-Performance Data Transfers, R. Prasad, M. Jain, C. Dovrolis, PDF
  • The Cable Guy: TCP Receive Window Auto-Tuning, J. Davies, January 2007, Microsoft TechNet Magazine
  • What's New in FreeBSD 7.0, F. Biancuzzi, A. Oppermann et al., February 2008, ONLamp
  • How to disable the TCP autotuning diagnostic tool, Microsoft Support, Article ID: 967475, February 2009

-- SimonLeinen - 04 Apr 2006 - 21 Mar 2011
-- ChrisWelti - 03 Aug 2006

 

TCP Window Scaling Option

TCP as originally specified only allows for windows up to 65536 bytes, because the window-related protocol fields all have a width of 16 bits. This prevents the use of large TCP windows, which are necessary to achieve high transmission rates over paths with a large bandwidth*delay product.

The Window Scaling Option is one of several options defined in RFC 7323, "TCP Extensions for High Performance". It allows TCP to advertise and use a window larger than 65536 bytes. The way this works is that in the initial handshake, the a TCP speaker announces a scaling factor, which is a power of two between 20 (no scaling) and 214, to allow for an effective window of 230 (one Gigabyte). Window Scaling only comes into effect when both ends of the connection advertise the option (even with just a scaling factor of 20).

As explained under the large TCP windows topic, Window Scaling is typically used in connection with the TCP Timestamp option and the PAWS (Protection Against Wrapped Sequence numbers), both defined in RFC 7323.

32K limitations on some systems when Window Scaling is not used

Some systems, notably Linux, limit the usable window in the absence of Window Scaling to 32 Kilobytes. The reasoning behind this is that some (other) systems used in the past were broken by larger windows, because they erroneously interpreted the 16-bit values related to TCP windows as signed, rather than unsigned, integers.

This limitation can have the effect that users (or applications) that request window sizes between 32K and 64K, but don't have Window Scaling enabled, do not actually benefit from the desired window sizes.

The artificial limitation was finally removed from the Linux kernel sources on March 21, 2006. The old behavior (window limited to 32KB when Window Scaling isn't used) can still be enabled through a sysctl tuning parameter, in case one really wants to interoperate with TCP implementations that are broken in the way described above.

Problems with window scaling

Middlebox issues

MiddleBoxes which do not understand window scaling may cause very poor performance, as described in WindowScalingProblems. Windows Vista limits the scaling factor for HTTP transactions to 2 to avoid some of the problems, see WindowsOSSpecific.

Diagnostic issues

The Window Scaling option has an impact on how the offered window in TCP ACKs should be interpreted. This can cause problems when one is faced with incomplete packet traces that lack the initial SYN/SYN+ACK handshake where the options are negotiated: The SEQ/ACK sequence may look incorrect because scaling cannot be taken into account. One effect of this is that tcpgraph or similar analysis tools (included in tools such as WireShark) may produce incoherent results.

References

  • RFC 7323, TCP Extensions for High Performance, D. Borman, B. Braden, V. Jacobson, B. Scheffenegger, September 2014 (obsoletes RFC 1323 from May 1993)
  • git note about the Linux kernel modification that got rid of the 32K limit, R. Jones and D.S. Miller, March 2006. Can also be retrieved as a patch.

-- SimonLeinen - 21 Mar 2006 - 27 September 2014
-- PekkaSavola - 07 Nov 2006
-- AlexGall - 31 Aug 2007 (added reference to scaling limitiation in Windows Vista)

TCP Flow Control

Note: This topic describes the Reno enhancement of classical "Van Jacobson" or Tahoe congestion control. There have been many suggestions for improving this mechanism - see the topic on high-speed TCP variants.

TCP flow control and window size adjustment is mainly based on two key mechanism: Slow Start and Additive Increase/Multiplicative Decrease (AIMD), also known as Congestion Avoidance. (RFC 793 and RFC 5681)

Slow Start

To avoid that a starting TCP connection floods the network, a Slow Start mechanism was introduced in TCP. This mechanism effectively probes to find the available bandwidth.

In addition to the window advertised by the receiver, a Congestion Window (cwnd) value is used and the effective window size is the lesser of the two. The starting value of the cwnd window is set initially to a value that has been evolving over the years, the TCP Initial Window. After each acknowledgment, the cwnd window is increased by one MSS. By this algorithm, the data rate of the sender doubles each round-trip time (RTT) interval (actually, taking into account Delayed ACKs, rate increases by 50% every RTT). For a properly implemented version of TCP this increase continues until:

  • the advertised window size is reached
  • congestion (packet loss) is detected on the connection.
  • there is no traffic waiting to take advantage of an increased window (i.e. cwnd should only grow if it needs to)

When congestion is detected, the TCP flow-control mode is changed from Slow Start to Congestion Avoidance. Note that some TCP implementations maintain cwnd in units of bytes, while others use units of full-sized segments.

Congestion Avoidance

Once congestion is detected (through timeout and/or duplicate ACKs), the data rate is reduced in order to let the network recover.

Slow Start uses an exponential increase in window size and thus also in data rate. Congestion Avoidance uses a linear growth function (additive increase). This is achieved by introducing - in addition to the cwnd window - a slow start threshold (ssthresh).

As long as cwnd is less than ssthresh, Slow Start applies. Once ssthresh is reached, cwnd is increased by at most one segment per RTT. The cwnd window continues to open with this linear rate until a congestion event is detected.

When congestion is detected, ssthresh is set to half the cwnd (or to be strictly accurate, half the "Flight Size". This distinction is important if the implementation lets cwnd grow beyond rwnd (the receiver's declared window)). cwnd is either set to 1 if congestion was signalled by a timeout, forcing the sender to enter Slow Start, or to ssthresh if congestion was signalled by duplicate ACKs and the Fast Recovery algorithm has terminated. In either case, once the sender enters Congestion Avoidance, its rate has been reduced to half the value at the time of congestion. This multiplicative decrease causes the cwnd to close exponentially with each detected loss event.

Fast Retransmit

In Fast Retransmit, the arrival of three duplicate ACKs is interpreted as packet loss, and retransmission starts before the retransmission timer (RTO) expires.

The missing segment will be retransmitted immediately without going through the normal retransmission queue processing. This improves performance by eliminating delays that would suspend effective data flow on the link.

Fast Recovery

Fast Recovery is used to react quickly to a single packet loss. In Fast recovery, the receipt of 3 duplicate ACKs, while being taken to mean a loss of a segment, does not result in a full Slow Start. This is because obviously later segments got through, and hence congestion is not stopping everything. In fast recovery, ssthresh is set to half of the current send window size, the missing segment is retransmitted (Fast Retransmit) and cwnd is set to ssthresh plus three segments. Each additional duplicate ACK indicates that one segment has left the network at the receiver and cwnd is increased by one segment to allow the transmission of another segment if allowed by the new cwnd. When an ACK is received for new data, cwmd is reset to the ssthresh, and TCP enters congestion avoidance mode.

References

-- UlrichSchmid - 07 Jun 2005 -- SimonLeinen - 27 Jan 2006 - 23 Jun 2011

High-Speed TCP Variants

There have been numerous ideas for improving TCP over the years. Some of those ideas have been adopted by mainstream operations (after thorough review). Recently there has been an uptake in work towards improving TCP's behavior with LongFatNetworks. It has been proven that the current congestion control algorithms limit the efficiency in network resource utilization. The various types of new TCP implementations introduce changes in defining the size of congestion window. Some of them require extra feedback from the network. Generally, all such congestion control protocols are divided as follows.

An orthogonal technique for improving TCP's performance is automatic buffer tuning.

Comparative Studies

All papers about individual TCP-improvement proposals contain comparisons against older TCP to quantify the improvement. There are several studies that compare the performance of the various new TCP variants, including

Other References

  • Gigabit TCP, G. Huston, The Internet Protocol Journal, Vol. 9, No. 2, June 2006. Contains useful descriptions of many modern TCP enhancements.
  • Faster, G. Huston, June 2005. This article from ISP Column looks at various approaches to TCP transfers at very high rates.
  • Congestion Control in the RFC Series, M. Welzl, W. Eddy, July 2006, Internet-Draft (work in progress)
  • ICCRG Wiki, IRTF (Internet Research Task Force) ICCRG (Internet Congestion Control Research Group), includes bibliography on congestion control.
  • RFC 6077: Open Research Issues in Internet Congestion Control, D. Papadimitriou (Ed.), M. Welzl, M. Scharf, B. Briscoe, February 2011

Related Work

  • PFLDnet (International Workshop on Protocols for Fast Long-Distance Networks): 2003, 2004, 2005, 2006, 2007, 2009, 2010. (The 2008 pages seem to have been removed from the Net.)

-- Simon Leinen - 2005-05-28–2016-09-18 -- Orla Mc Gann - 2005-10-10 -- Chris Welti - 2008-02-25

Line: 62 to 61
 

RUDE/CRUDE

RUDE is a package of applications to generate and measure UDP traffic between two endpoints (hosts). The rude tool generates traffic to the network, which can be received and logged on the other side of the network with crude. The traffic pattern can be defined by the user. This tool is available under http://rude.sourceforge.net/

Example

The following rude script will cause a constant-rate packet stream to be sent to a destination for 4 seconds, starting one second (1000 milliseconds) from the time the script was read, stopping five seconds (5000 milliseconds) after the script was read. The script can be called using rude -s example.rude.

START NOW
1000 0030 ON 3002 10.1.1.1:10001 CONSTANT 200 250
5000 0030 OFF

Explanation of the values: 1000 and 5000 are relative timestamps (in milliseconds) for the actions. 0030 is the identification of a traffic flow. At 1000, a CONSTANT-rate flow is started (ON) with source port 3002, destination address 10.1.1.1 and destination port 10001. CONSTANT 200 250 specifies a fixed rate of 200 packets per second (pps) of 250-byte UDP datagrams.

-- FrancoisXavierAndreu & SimonMuyal - 06 Jun 2005
-- SimonLeinen - 28 Sep 2008

TTCP

TTCP (Test TCP) is a utility for benchmarking UDP and TCP performance. Utility for Unix and Windows is available at http://www.pcausa.com/Utilities/pcattcp.htm/

-- FrancoisXavierAndreu & SimonMuyal - 06 Jun 2005

NDT (Network Diagnostic Tool)

NDT can be used to check the TCP configuration of any host that can run Java applets. The client connects to a Web page containing a special applet. The Web server that serves the applet must run a kernel with the Web100 extensions for TCP measurement instrumentation. The applet performs TCP memory-to-memory tests between the client and the Web100-instrumented server, and then uses the measurements from the Web100 instrumentation to find out about the TCP configuration of the client. NDT also detects common configuration errors such as duplex mismatch.

Besides the applet, there is also a command-line client called web100clt that can be used without a browser. The client works on Linux without any need for Web100 extensions, and can be compiled on other Unix systems as well. It doesn't require a Web server on the server side, just the NDT server web100srv - which requires a Web100-enhanced kernel.

Applicability

Since NDT performs memory-to-memory tests, it avoids end-system bottlenecks such as file system or disk limitations. Therefore it is useful for estimating "pure" TCP throughput limitations for a system, better than, for instance, measuring the throughput of a large file retrieval from a public FTP or WWW server. In addition, NDT servers are supposed to be well-connected and lightly loaded.

When trying to tune a host for maximum throughput, it is a good idea to start testing it against an NDT server that is known to be relatively close to the host in question. Once the throughput with this server is satisfactory, one can try to further tune the host's TCP against a more distant NDT server. Several European NRENs as well as many sites in the U.S. operate NDT servers.

Example

Applet (tcp100bw.html)

This example shows the results (overview) of an NDT measurement between a client in Switzerland (Gigabit Ethernet connection) and an IPv6 NDT server in Ljubljana, Slovenia (Gigabit Ethernet connection).

Screen_shot_2011-07-25_at_18.11.44_.png

Command-line client

The following is an example run of the web100clt command-line client application. In this case, the client runs on a small Linux-based home router connected to a commercial broadband-over-cable Internet connection. The connection is marketed as "5000 kb/s downstream, 500 kb/s upstream", and it can be seen that this nominal bandwidth can in fact be achieved for an individual TCP connection.

$ ./web100clt -n ndt.switch.ch
Testing network path for configuration and performance problems  --  Using IPv6 address
Checking for Middleboxes . . . . . . . . . . . . . . . . . .  Done
checking for firewalls . . . . . . . . . . . . . . . . . . .  Done
running 10s outbound test (client to server) . . . . .  509.00 kb/s
running 10s inbound test (server to client) . . . . . . 5.07 Mb/s
Your host is connected to a Cable/DSL modem
Information [C2S]: Excessive packet queuing detected: 10.18% (local buffers)
Information [S2C]: Excessive packet queuing detected: 73.20% (local buffers)
Server 'ndt.switch.ch' is not behind a firewall. [Connection to the ephemeral port was successful]
Client is probably behind a firewall. [Connection to the ephemeral port failed]
Information: Network Middlebox is modifying MSS variable (changed to 1440)
Server IP addresses are preserved End-to-End
Client IP addresses are preserved End-to-End

Note the remark "-- Using IPv6 address". The example used the new version 3.3.12 of the NDT software, which includes support for both IPv4 and IPv6. Because both the client and the server support IPv6, the test is run over IPv6. To run an IPv4 test in this situation, one could have called the client with the -4 option, i.e. web100clt -4 -n ndt.switch.ch.

Public Servers

There are many public NDT server installations available on the Web. You should choose between the servers according to distance (in terms of RTT) and the throughput range you are interested in. The limits of your local connectivity can best be tested by connecting to a server with a fast(er) connection that is close by. If you want to tune your TCP parameters for "LFNs", use a server that is far away from you, but still reachable over a path without bottlenecks.

Development

Development of NDT is now hosted on Google Code, and as of July 2011 there is some activity. Notably, there is now documentation about the architecture as well as the detailed protocol or NDT. One of the stated goals is to allow outside parties to write compatible NDT clients without consulting the NDT source code.

References

-- SimonLeinen - 30 Jun 2005 - 22 Nov 2011

Added:
>
>

Active Measurement Tools

Active measurement injects traffic into a network to measure properties about that network.

  • ping - a simple RTT and loss measurement tool using ICMP ECHO messages
  • fping - ping variant supporting concurrent measurement to multiple destinations
  • SmokePing - nice graphical representation of RTT distribution and loss rate from periodic "pings"
  • OWAMP - a protocol and tool for one-way measurements

There are several permanent infrastructures performing active measurements:

  • HADES DFN-developed HADES Active Delay Evaluation System, deployed in GEANT/GEANT2 and DFN
  • RIPE TTM
  • QoSMetricsBoxes used in RENATER's Métrologie project

-- SimonLeinen - 29 Mar 2006

ping

Ping sends ICMP echo request messages to a remote host and waits for ICMP echo replies to determine the latency between those hosts. The output shows the Round Trip Time (RTT) between the host machine and the remote target. Ping is often used to determine whether a remote host is reachable. Unfortunately, it is quite common these days for ICMP traffic to be blocked by packet filters / firewalls, so a ping timing out does not necessarily mean that a host is unreachable.

(Another method for checking remote host availability is to telnet to a port that you know to be accessible; such as port 80 (HTTP) or 25 (SMTP). If the connection is still timing out, then the host is probably not reachable - often because of a filter/firewall silently blocking the traffic. When there is no service listening on that port, this usually results in a "Connection refused" message. If a connection is made to the remote host, then it can be ended by typing the escape character (usually 'CTRL ^ ]') then quit.)

The -f flag may be specified to send ping packets as fast as they come back, or 100 times per second - whichever is more frequent. As this option can be very hard on the network only a super-user (the root account on *NIX machines) is allowed to specified this flag in a ping command.

The -c flag specifies the number of Echo request messages sent to the remote host by ping. If this flag isn't used, then the ping continues to send echo request messages until the user types CTRL C. If the ping is cancelled after only a few messages have been sent, the RTT summary statistics that are displayed at the end of the ping output may not have finished being calculated and won't be completely accurate. To gain an accurate representation of the RTT, it is recommended to set a count of 100 pings. The MS Windows implementation of ping just sends 4 echo request messages by default.

Ping6 is the IPv6 implementation of the ping program. It works in the same way, but sends ICMPv6 Echo Request packets and waits for ICMPv6 Echo Reply packets to determine the RTT between two hosts. There are no discernable differences between the Unix implementations and the MS Windows implementation.

As with traceroute, Solaris integrates IPv4 and IPv6 functionality in a single ping binary, and provides options to select between them (-A <AF>).

See fping for a ping variant that supports concurrent measurements to multiple destinations and is easier to use in scripts.

Implementation-specific notes

-- TobyRodwell - 06 Apr 2005
-- MarcoMarletta - 15 Nov 2006 (Added the difference in size between Juniper and Cisco)

fping

fping is a ping like program which uses the Internet Control Message Protocol (ICMP) echo request to determine if a host is up. fping is different from ping in that you can specify any number of hosts on the command line, or specify a file containing the lists of hosts to ping. Instead of trying one host until it timeouts or replies, fping will send out a ping packet and move on to the next host in a round-robin fashion. If a host replies, it is noted and removed from the list of hosts to check. If a host does not respond within a certain time limit and/or retry limit it will be considered unreachable.

Unlike ping, fping is meant to be used in scripts and its output is easy to parse. It is often used as a probe for packet loss and round-trip time in SmokePing.

fping version 3

The original fping was written by Roland Schemers in 1992, and stopped being updated in 2002. In December 2011, David Schweikert decided to take up maintainership of fping, and increased the major release number to 3, mainly to reflect the change of maintainer. Changes from earlier versions include:

  • integration of all Debian patches, including IPv6 support (2.4b2-to-ipv6-16)
  • an optimized main loop that is claimed to bring performance close to the theoretical maximum
  • patches to improve SmokePing compatibility

The Web site location has changed to fping.org, and the source is now maintained on GitHub.

Since July 2013, there is also a Mailing List, which will be used to announce new releases.

Examples

Simple example of usage:

# fping -c 3 -s www.man.poznan.pl www.google.pl
www.google.pl     : [0], 96 bytes, 8.81 ms (8.81 avg, 0% loss)
www.man.poznan.pl : [0], 96 bytes, 37.7 ms (37.7 avg, 0% loss)
www.google.pl     : [1], 96 bytes, 8.80 ms (8.80 avg, 0% loss)
www.man.poznan.pl : [1], 96 bytes, 37.5 ms (37.6 avg, 0% loss)
www.google.pl     : [2], 96 bytes, 8.76 ms (8.79 avg, 0% loss)
www.man.poznan.pl : [2], 96 bytes, 37.5 ms (37.6 avg, 0% loss)

www.man.poznan.pl : xmt/rcv/%loss = 3/3/0%, min/avg/max = 37.5/37.6/37.7
www.google.pl     : xmt/rcv/%loss = 3/3/0%, min/avg/max = 8.76/8.79/8.81

       2 targets
       2 alive
       0 unreachable
       0 unknown addresses

       0 timeouts (waiting for response)
       6 ICMP Echos sent
       6 ICMP Echo Replies received
       0 other ICMP received

 8.76 ms (min round trip time)
 23.1 ms (avg round trip time)
 37.7 ms (max round trip time)
        2.039 sec (elapsed real time)

IPv6 Support

Jeroen Massar has added IPv6 support to fping. This has been implemented as a compile-time variant, so that there are separate fping (for IPv4) and fping6 (for IPv6) binaries. The IPv6 patch has been partially integrated into the fping version on www.fping.com as of release "2.4b2_to-ipv6" (thus also integrated in fping 3.0). Unfortunately his modifications to the build routine seem to have been lost in the integration, so that the fping.com version only installs the IPv6 version as fping. Jeroen's original version doesn't have this problem, and can be downloaded from his IPv6 Web page.

ICMP Sequence Number handling

Older versions of fping used the Sequence Number field in ICMP ECHO requests in a peculiar way: it used a different sequence number for each destination host, but used the same sequence number for all requests to a specific host. There have been reports of specific systems that suppress (or rate-limit) ICMP ECHO requests with repeated sequence numbers, which causes high loss rates reported from tools that use fping, such as SmokePing. Another issue is that fping could not distinguish a perfect link from one that drops every other packet and that duplicates every other.

Newer fping versions such as 3.0 or 2.4.2b2_to (on Debian GNU/Linux) include a change to sequence number handling attributed to Stephan Fuhrmann. These versions increment sequence numbers for every probe sent, which should solve both of these problems.

References

-- BartoszBelter - 2005-07-14 - 2005-07-26
-- SimonLeinen - 2008-05-19 - 2013-07-26

OWAMP (One-Way Active Measurement Tool)

OWAMP is a command line client application and a policy daemon used to determine one way latencies between hosts. It is an implementation of the OWAMP protocol (see references) that is currently going through the standardization process within the IPPM WG in the IETF.

With roundtrip-based measurements, it is hard to isolate the direction in which congestion is experienced. One-way measurements solve this problem and make the direction of congestion immediately apparent. Since traffic can be asymmetric at many sites that are primarily producers or consumers of data, this allows for more informative measurements. One-way measurements allow the user to better isolate the effects of specific parts of a network on the treatment of traffic.

The current OWAMP implementation (V3.1) supports IPv4 and IPv6.

Example using owping (OWAMP v3.1) on a linux host:

welti@atitlan:~$ owping ezmp3
Approximately 13.0 seconds until results available

--- owping statistics from [2001:620:0:114:21b:78ff:fe30:2974]:52887 to [ezmp3]:41530 ---
SID:    feae18e9cd32e3ee5806d0d490df41bd
first:  2009-02-03T16:40:31.678
last:   2009-02-03T16:40:40.894
100 sent, 0 lost (0.000%), 0 duplicates
one-way delay min/median/max = 2.4/2.5/3.39 ms, (err=0.43 ms)
one-way jitter = 0 ms (P95-P50)
TTL not reported
no reordering


--- owping statistics from [ezmp3]:41531 to [2001:620:0:114:21b:78ff:fe30:2974]:42315 ---
SID:    fe302974cd32e3ee7ea1a5004fddc6ff
first:  2009-02-03T16:40:31.479
last:   2009-02-03T16:40:40.685
100 sent, 0 lost (0.000%), 0 duplicates
one-way delay min/median/max = 1.99/2.1/2.06 ms, (err=0.43 ms)
one-way jitter = 0 ms (P95-P50)
TTL not reported
no reordering

References

-- ChrisWelti - 03 Apr 2006 - 03 Feb 2009 -- SimonLeinen - 29 Mar 2006 - 01 Dec 2008

 

Active Measurement Boxes

-- FrancoisXavierAndreu & SimonLeinen - 06 Jun 2005 - 06 Jul 2005 Warning: Can't find topic PERTKB.IppmBoxes

RIPE TTM Boxes

Summary

The Test Traffic Measurements service has been offered by the RIPE NCC as a service since the year 2000. The system continuously records one-way delay and packet-loss measurements as well as router-level paths ("traceroutes") between a large set of probe machines ("test boxes"). The test boxes are hosted by many different organisations, including NRENs, commercial ISPs, universities and others, and usually maintained by RIPE NCC as part of the service. While the vast majority of the test boxes is in Europe, there are a couple of machines in other continents, including outside the RIPE NCC's service area. Every test box includes a precise time source - typically a GPS receiver - so that accurate one-way delay measurements can be provided. Measurement data are entered in a central database at RIPE NCC's premises every night. The database is based on CERN's ROOT system. Measurement results can be retrieved over the Web using various presentations, both pre-generated and "on-demand".

Applicability

RIPE TTM data is often useful to find out "historical" quality information about a network path of interest, provided that TTM test boxes are deployed near (in the topological sense, not just geographically) the ends of the path. For research network paths throughout Europe, the coverage of the RIPE TTM infrastructure is not complete, but quite comprehensive. When one suspects that there is (non-temporary) congestion on a path covered by the RIPE TTM measurement infrastructure, the TTM graphs can be used to easily verify that, because such congestion will show up as delay and/or loss if present.

The RIPE TTM system can also be used by operators to precisely measure - but only after the fact - the impact of changes in the network. These changes can include disturbances such as network failures or misconfigurations, but also things such as link upgrades or routing improvements.

Notes

The TTM project has provided extremely high-quality and stable delay and loss measurements for a large set of network paths (IPv4 and recently also IPv6) throughout Europe. These paths cover an interesting mix of research and commercial networks. The Web interface to the collected measurements supports in-depth exploration quite well, such as looking at the delay/loss evolution of a specific path over both a wide range of intervals from the very short to the very long. On the other hand, it is hard to get at useful "overview" pictures. The RIPE NCC's DNS Server Monitoring (DNSMON) service provides such overviews for specific sets of locations.

The raw data collected by the RIPE TTM infrastructure is not generally available, in part because of restrictions on data disclosure. This is understandable in that RIPE NCC's main member/customer base consists of commercial ISPs, who presumably wouldn't allow full disclosure for competitive reasons. But it also means that in practice, only the RIPE NCC can develop new tools that make use of these data, and their resources for doing so are limited, in part due to lack of support from their ISP membership. On a positive note, the RIPE NCC has, on several occasions, given scientists access to the measurement infrastructure for research. The TTM team also offers to extract raw data (in ROOT format) from the TTM database manually on demand.

Another drawback of RIPE TTM is that the central TTM database is only updated once a day. This makes the system impossible to use for near-real-time diagnostic purposes. (For test box hosts, it is possible to access the data collected by one's hosted test boxes in near realy time, but this provides only very limited functionality, because only information about "inbound" packets is available locally, and therefore one can neither correlate delays with path changes, nor can one compute loss figures without access to the sending host's data.) In contrast, RIPE NCC's Routing Information Service (RIS) infrastructure provides several ways to get at the collected data almost as it is collected. And because RIS' raw data is openly available in a well-defined and relatively easy-to-use format, it is possible for third parties to develop innovative and useful ways to look at that data.

Examples

The following screenshot shows an on-demand graph of the delays over a 24-hour period from one test box hosted by SWITCH (tt85 at ETH Zurich) to to other one (tt86 at the University of Geneva). The delay range has been narrowed so that the fine-grained distribution can be seen. Five packets are listed as "overflow" under "STATISTICS - Delay & Hops" because their one-way delays exceeded the range of the graph. This is an example of an almost completely uncongested path, showing no packet loss and the vast majority of delays very close to the "baseline" value.

ripe-ttm-sample-delay-on-demand.png

The following plot shows the impact of a routing change on a path to a test box in New Zealand (tt47). The hop count decreases by one, but the base one-way delay increased by about 20ms. The old and new routes can be retrieved from the Web interface too. Note also that the delay values are spread out over a fairly large range, which indicates that there is some congestion (and thus queueing) on the path. The loss rate over the entire day is 1.1%. It is somewhat interesting how much of this is due to the routing change and how much (if any) is due to "normal" congestion. The lower right graph shows "arrived/lost" packets over time and indicates that most packets were in fact lost during the path change - so although some congestion is visible in the delays, it is not severe enough to cause visible loss. On the other hand, the total number of probe packets sent (about 2850 per day) means that congestion loss would probably go unnoticed until it reaches a level of 0.1% or so.

day.tt85.tt47.gif

References

-- FrancoisXavierAndreu & SimonLeinen - 06 Jun 2005 - 06 Jul 2005

Revision 122006-03-29 - SimonLeinen

Line: 1 to 1
 
META TOPICPARENT name="AutoToc"

Unrolled PERTKB Contents

Line: 19 to 19
 

TCP (Transmission Control Protocol)

The Transmission Control Protocol (TCP) is the prevalent transport protocol used on the Internet today. It uses window-based transmission to provide the service of a reliable byte stream, and adapts the rate of transfer to the state (of congestion) of the network and the receiver. Basic mechanisms include:

  • Segments that fit into IP packets, into which the byte-stream is split by the sender,
  • a checksum for each segment,
  • a Window, which bounds the amount of data "in flight" between the sender and the receiver,
  • Acknowledgements, by which the receiver tells the sender about segments that were received successfully,
  • a flow-control mechanism, which regulates data rate as a function of network and receiver conditions.

Originally specified in September 1981 RFC 793, TCP was clarified, refined and extended in many documents. Perhaps most importantly, congestion control was introduced in "TCP Tahoe" in 1988, described in Van Jacobson's 1988 SIGCOMM article on "Congestion Avoidance and Control". It can be said that TCP's congestion control is what keeps the Internet working when links are overloaded. In today's Internet, the enhanced "Reno" variant of congestion control is probably the most widespread.

RFC 7323 (formerly RFC 1323) specifies a set of "TCP Extensions for High Performance", namely the Window Scaling Option, which provides for much larger windows than the original 64K, the Timestamp Option and the PAWS (Protection Against Wrapped Sequence numbers) mechanism. These extensions are supported by most contemporary TCP stacks, although they frequently must be activated explicitly (or implicitly by configuring large TCP windows).

Another widely implemented performance enhancement to TCP are selective acknowledgements (SACK, RFC 2018). In TCP as originally specified, the acknowledgements ("ACKs") sent by a receiver were always "cumulative", that is, they specified the last byte of the part of the stream that was completely received. This is advantageous with large TCP windows, in particular where chances are high that multiple segments within a single window are lost. RFC 2883 describes an extension to SACK which makes TCP more robust in the presence of packet reordering in the network.

In addition, RFC 3449 - TCP Performance Implications of Network Path Asymmetry provides excellent information since a vast majority of Internet connections are asymmetrical.

Further Reading

References

  • RFC 7414, A Roadmap for Transmission Control Protocol (TCP) Specification Documents]], M. Duke, R. Braden, W. Eddy, E. Blanton, A. Zimmermann, February 2015
  • RFC 793, Transmission Control Protocol]], J. Postel, September 1981
  • draft-ietf-tcpm-rfc793bis-10, Transmission Control Protocol Specification, Wesley M. Eddy, July 2018
  • RFC 7323, TCP Extensions for High Performance, D. Borman, B. Braden, V. Jacobson, B. Scheffenegger, September 2014 (obsoletes RFC 1323 from May 1993)
  • RFC 2018, TCP Selective Acknowledgement Options]], M. Mathis, J. Mahdavi, S. Floyd, A. Romanow, October 1996
  • RFC 5681, TCP Congestion Control]], M. Allman, V. Paxson, E. Blanton, September 2009
  • RFC 3449, TCP Performance Implications of Network Path Asymmetry]], H. Balakrishnan, V. Padmanabhan, G. Fairhurst, M. Sooriyabandara, December 2002

There is ample literature on TCP, in particular research literature on its performance and large-scale behaviour.

Juniper has a nice white paper, Supporting Differentiated Service Classes: TCP Congestion Control Mechanisms (PDF format) explaining TCP's congestion control as well as many of the enhancements proposed over the years.

-- Simon Leinen - 2004-10-27 - 2018-07-02 -- Ulrich Schmid - 2005-05-31

Window-Based Transmission

TCP is a sliding-window protocol. The receiver tells the sender the available buffer space at the receiver (TCP header field "window"). The total window size is the minimum of sender buffer size, advertised receiver window size and congestion window size.

The sender can transmit up to this amount of data before having to wait for further buffer update from the receiver and should not have more than this amount of data in transit in the network. The sender must buffer the sent data until it has been ACKed by the receiver, so that the data can be retransmitted if neccessary. For each ACK the sent segment left the window and a new segment fills the window if it fits the (possibly updated) window buffer.

Due to TCP's flow control mechanism, TCP window size can limit the maximum theoretical throughput regardless of the bandwidth of the network path. Using too small a TCP window can degrade the network performance lower than expected and a too large window may have the same problems in case of error recovery.

The TCP window size is the most important parameter for achieving maximum throughput across high-performance networks. To reach the maximum transfer rate, the TCP window should be no smaller than the bandwidth-delay product.

Window size (bytes) => Bandwidth (bytes/sec) x Round-trip time (sec)

Example:

window size: 8192 bytes
round-trip time: 100ms
maximum throughput: < 0.62 Mbit/sec.

References

<--   * Calculating TCP goodput, Javascript-based on-line calculator by V. Reijs, https://webmail.heanet.ie/~vreijs/tcpcalculations.htm -->

-- UlrichSchmid & SimonLeinen - 31 May-07 Jun 2005

Large TCP Windows

In order to achieve high data rates with TCP over "long fat networks", i.e. network paths with a large bandwidth-delay product, TCP sinks (that is, hosts receiving data transported by TCP) must advertise a large TCP receive window (referred to as just 'the window', since there is not an equivalent advertised 'send window').

The window is a 16 bit value (bytes 15 and 16 in the TCP header) and so, in TCP as originally specified, it is limited to a value of 65535 (64K). The receive window sets an upper limit on the sustained throughput achieveable over a TCP connection since it represents the maximum amount of unacknowledged data (in bytes) there can be on the TCP path. Mathematically, achieveable throughput can never be more than WINDOW_SIZE/RTT, so for a trans-Atlantic link, with say an RTT (Round trip Time) of 150ms, the throughput is limited to a maximum of 3.4Mbps. With the emergence of "long fat networks", the limit of 64K bytes (on some systems even just 32K bytes!) was clearly insufficient and so RFC 7323 laid down (amongst other things) a way of scaling the advertised window, such that the 16-bit window value can represent numbers larger than 64K.

RFC 7323 extensions

RFC 7323, TCP Extensions for High Performance, (and formerly RFC 1323) defines several mechanisms to enable high-speed transfers over LFNs: Window Scaling, TCP Timestamps, and Protection Against Wrapped Sequence numbers (PAWS).

The TCP window scaling option increases the maximum window size fom 64KB to 1Gbyte, by shifting the window field left by up to 14. The window scale option is used only during the TCP 3-way handshake (both sides must set the window scale option in their SYN segments if scaling is to be used in either direction).

It is important to use TCP timestamps option with large TCP windows. With the TCP timestamps option, each segment contains a timestamp. The receiver returns that timestamp in each ACK and this allows the sender to estimate the RTT. On the other hand with the TCP timestamps option the problem of wrapped sequence number could be solved (PAWS - Protection Against Wrapped Sequences) which could occur with large windows.

(Auto) Configuration

In the past, most operating systems required manual tuning to use large TCP windows. The OS-specific tuning section contains information on how to to this for a set of operating systems.

Since around 2008-2010, many popular operating systems will use large windows and the necessary protocol options by default, thanks to TCP Buffer Auto-Tuning.

Can TCP Windows ever be too large?

There are several potential issues when TCP Windows are larger than necessary:

  1. When there are many active TCP connection endpoints (sockets) on a system - such as a popular Web or file server - then a large TCP window size will lead to high consumption of system (kernel) memory. This can have a number of negative consequences: The system may run out of buffer space so that no new connections can be opened, or the high occupation of kernel memory (which typically must reside in actual RAM and cannot be "paged out" to disk) can "starve" other processes of access to fast memory (cache and RAM)
  2. Large windows can cause large "bursts" of consecutive segments/packets. When there is a bottleneck in the path - because of a slower link or because of cross-traffic - these bursts will fill up buffers in the network device (router or switch) in front of that bottleneck. The larger these bursts, the higher are the risks that this buffer overflows and causes multiple segments to be dropped. So a large window can lead to "sawtooth" behavior and worse link utilisation than with a just-big-enough window where TCP could operate at a steady rate.

Several methods for automatic TCP buffer tuning have been developed to resolve these issues, some of which have been implemented in (at least) recent Linux versions.

References

-- SimonLeinen - 27 Oct 2004 - 27 Sep 2014

Added:
>
>

Automatic Tuning of TCP Buffers

Note: This mechanism is sometimes referred to as "Dynamic Right-Sizing" (DRS).

The issues mentioned under "Large TCP Windows" are arguments in favor of "buffer auto-tuning", a promising but relatively new approach to better TCP performance in operating systems. See the TCP auto-tuning zoo reference for a description of some approaches.

Microsoft introduced (receive-side) buffer auto-tuning in Windows Vista. This implementation is explained in a TechNet Magazine "Cable Guy" article.

FreeBSD introduced buffer auto-tuning as part of its 7.0 release.

Mac OS X introduced buffer auto-tuning in release 10.5.

Linux auto-tuning details

Some automatic buffer tuning is implemented in Linux 2.4 (sender-side), and Linux 2.6 implements it for both the send and receive directions.

In a post to the web100-discuss mailing list, John Heffner describes the Linux 2.6.16 (March 2006) Linux implementation as follows:

For the sender, we explored separating the send buffer and retransmit queue, but this has been put on the back burner. This is a cleaner approach, but is not necessary to achieve good performance. What is currently implemented in Linux is essentially what is described in Semke '98, but without the max-min fair sharing. When memory runs out, Linux implements something more like congestion control for reducing memory. It's not clear that this is well-behaved, and I'm not aware of any literature on this. However, it's rarely used in practice.

For the receiver, we took an approach similar to DRS, but not quite the same. RTT is measured with timestamps (when available), rather than using a bounding function. This allows it to track a rise in RTT (for example, due to path change or queuing). Also, a subtle but important difference is that receiving rate is measured by the amount of data consumed by the application, not data received by TCP.

Matt Mathis reports on the end2end-interest mailing list (26/07/06):

Linux 2.6.17 now has sender and receiver side autotuning and a 4 MB DEFAULT maximum window size. Yes, by default it negotiates a TCP window scale of 7.
4 MB is sufficient to support about 100 Mb/s on a 300 ms path or 1 Gb/s on a 30 ms path, assuming you have enough data and an extremely clean (loss-less) network.

References

  • TCP auto-tuning zoo, Web page by Tom Dunegan of Oak Ridge National Laboratory, PDF
  • A Comparison of TCP Automatic Tuning Techniques for Distributed Computing, E. Weigle, W. Feng, 2002, PDF
  • Automatic TCP Buffer Tuning, J. Semke, J. Mahdavi, M. Mathis, SIGCOMM 1998, PS
  • Dynamic Right-Sizing in TCP, M. Fisk, W. Feng, Proc. of the Los Alamos Computer Science Institute Symposium, October 2001, PDF
  • Dynamic Right-Sizing in FTP (drsFTP): Enhancing Grid Performance in User-Space, M.K. Gardner, Wu-chun Feng, M. Fisk, July 2002, PDF. This paper describes an implementation of buffer-tuning at application level for FTP, i.e. outside of the kernel.
  • Socket Buffer Auto-Sizing for High-Performance Data Transfers, R. Prasad, M. Jain, C. Dovrolis, PDF
  • The Cable Guy: TCP Receive Window Auto-Tuning, J. Davies, January 2007, Microsoft TechNet Magazine
  • What's New in FreeBSD 7.0, F. Biancuzzi, A. Oppermann et al., February 2008, ONLamp
  • How to disable the TCP autotuning diagnostic tool, Microsoft Support, Article ID: 967475, February 2009

-- SimonLeinen - 04 Apr 2006 - 21 Mar 2011
-- ChrisWelti - 03 Aug 2006

 

TCP Window Scaling Option

TCP as originally specified only allows for windows up to 65536 bytes, because the window-related protocol fields all have a width of 16 bits. This prevents the use of large TCP windows, which are necessary to achieve high transmission rates over paths with a large bandwidth*delay product.

The Window Scaling Option is one of several options defined in RFC 7323, "TCP Extensions for High Performance". It allows TCP to advertise and use a window larger than 65536 bytes. The way this works is that in the initial handshake, the a TCP speaker announces a scaling factor, which is a power of two between 20 (no scaling) and 214, to allow for an effective window of 230 (one Gigabyte). Window Scaling only comes into effect when both ends of the connection advertise the option (even with just a scaling factor of 20).

As explained under the large TCP windows topic, Window Scaling is typically used in connection with the TCP Timestamp option and the PAWS (Protection Against Wrapped Sequence numbers), both defined in RFC 7323.

32K limitations on some systems when Window Scaling is not used

Some systems, notably Linux, limit the usable window in the absence of Window Scaling to 32 Kilobytes. The reasoning behind this is that some (other) systems used in the past were broken by larger windows, because they erroneously interpreted the 16-bit values related to TCP windows as signed, rather than unsigned, integers.

This limitation can have the effect that users (or applications) that request window sizes between 32K and 64K, but don't have Window Scaling enabled, do not actually benefit from the desired window sizes.

The artificial limitation was finally removed from the Linux kernel sources on March 21, 2006. The old behavior (window limited to 32KB when Window Scaling isn't used) can still be enabled through a sysctl tuning parameter, in case one really wants to interoperate with TCP implementations that are broken in the way described above.

Problems with window scaling

Middlebox issues

MiddleBoxes which do not understand window scaling may cause very poor performance, as described in WindowScalingProblems. Windows Vista limits the scaling factor for HTTP transactions to 2 to avoid some of the problems, see WindowsOSSpecific.

Diagnostic issues

The Window Scaling option has an impact on how the offered window in TCP ACKs should be interpreted. This can cause problems when one is faced with incomplete packet traces that lack the initial SYN/SYN+ACK handshake where the options are negotiated: The SEQ/ACK sequence may look incorrect because scaling cannot be taken into account. One effect of this is that tcpgraph or similar analysis tools (included in tools such as WireShark) may produce incoherent results.

References

  • RFC 7323, TCP Extensions for High Performance, D. Borman, B. Braden, V. Jacobson, B. Scheffenegger, September 2014 (obsoletes RFC 1323 from May 1993)
  • git note about the Linux kernel modification that got rid of the 32K limit, R. Jones and D.S. Miller, March 2006. Can also be retrieved as a patch.

-- SimonLeinen - 21 Mar 2006 - 27 September 2014
-- PekkaSavola - 07 Nov 2006
-- AlexGall - 31 Aug 2007 (added reference to scaling limitiation in Windows Vista)

TCP Flow Control

Note: This topic describes the Reno enhancement of classical "Van Jacobson" or Tahoe congestion control. There have been many suggestions for improving this mechanism - see the topic on high-speed TCP variants.

TCP flow control and window size adjustment is mainly based on two key mechanism: Slow Start and Additive Increase/Multiplicative Decrease (AIMD), also known as Congestion Avoidance. (RFC 793 and RFC 5681)

Slow Start

To avoid that a starting TCP connection floods the network, a Slow Start mechanism was introduced in TCP. This mechanism effectively probes to find the available bandwidth.

In addition to the window advertised by the receiver, a Congestion Window (cwnd) value is used and the effective window size is the lesser of the two. The starting value of the cwnd window is set initially to a value that has been evolving over the years, the TCP Initial Window. After each acknowledgment, the cwnd window is increased by one MSS. By this algorithm, the data rate of the sender doubles each round-trip time (RTT) interval (actually, taking into account Delayed ACKs, rate increases by 50% every RTT). For a properly implemented version of TCP this increase continues until:

  • the advertised window size is reached
  • congestion (packet loss) is detected on the connection.
  • there is no traffic waiting to take advantage of an increased window (i.e. cwnd should only grow if it needs to)

When congestion is detected, the TCP flow-control mode is changed from Slow Start to Congestion Avoidance. Note that some TCP implementations maintain cwnd in units of bytes, while others use units of full-sized segments.

Congestion Avoidance

Once congestion is detected (through timeout and/or duplicate ACKs), the data rate is reduced in order to let the network recover.

Slow Start uses an exponential increase in window size and thus also in data rate. Congestion Avoidance uses a linear growth function (additive increase). This is achieved by introducing - in addition to the cwnd window - a slow start threshold (ssthresh).

As long as cwnd is less than ssthresh, Slow Start applies. Once ssthresh is reached, cwnd is increased by at most one segment per RTT. The cwnd window continues to open with this linear rate until a congestion event is detected.

When congestion is detected, ssthresh is set to half the cwnd (or to be strictly accurate, half the "Flight Size". This distinction is important if the implementation lets cwnd grow beyond rwnd (the receiver's declared window)). cwnd is either set to 1 if congestion was signalled by a timeout, forcing the sender to enter Slow Start, or to ssthresh if congestion was signalled by duplicate ACKs and the Fast Recovery algorithm has terminated. In either case, once the sender enters Congestion Avoidance, its rate has been reduced to half the value at the time of congestion. This multiplicative decrease causes the cwnd to close exponentially with each detected loss event.

Fast Retransmit

In Fast Retransmit, the arrival of three duplicate ACKs is interpreted as packet loss, and retransmission starts before the retransmission timer (RTO) expires.

The missing segment will be retransmitted immediately without going through the normal retransmission queue processing. This improves performance by eliminating delays that would suspend effective data flow on the link.

Fast Recovery

Fast Recovery is used to react quickly to a single packet loss. In Fast recovery, the receipt of 3 duplicate ACKs, while being taken to mean a loss of a segment, does not result in a full Slow Start. This is because obviously later segments got through, and hence congestion is not stopping everything. In fast recovery, ssthresh is set to half of the current send window size, the missing segment is retransmitted (Fast Retransmit) and cwnd is set to ssthresh plus three segments. Each additional duplicate ACK indicates that one segment has left the network at the receiver and cwnd is increased by one segment to allow the transmission of another segment if allowed by the new cwnd. When an ACK is received for new data, cwmd is reset to the ssthresh, and TCP enters congestion avoidance mode.

References

-- UlrichSchmid - 07 Jun 2005 -- SimonLeinen - 27 Jan 2006 - 23 Jun 2011

High-Speed TCP Variants

There have been numerous ideas for improving TCP over the years. Some of those ideas have been adopted by mainstream operations (after thorough review). Recently there has been an uptake in work towards improving TCP's behavior with LongFatNetworks. It has been proven that the current congestion control algorithms limit the efficiency in network resource utilization. The various types of new TCP implementations introduce changes in defining the size of congestion window. Some of them require extra feedback from the network. Generally, all such congestion control protocols are divided as follows.

An orthogonal technique for improving TCP's performance is automatic buffer tuning.

Comparative Studies

All papers about individual TCP-improvement proposals contain comparisons against older TCP to quantify the improvement. There are several studies that compare the performance of the various new TCP variants, including

Other References

  • Gigabit TCP, G. Huston, The Internet Protocol Journal, Vol. 9, No. 2, June 2006. Contains useful descriptions of many modern TCP enhancements.
  • Faster, G. Huston, June 2005. This article from ISP Column looks at various approaches to TCP transfers at very high rates.
  • Congestion Control in the RFC Series, M. Welzl, W. Eddy, July 2006, Internet-Draft (work in progress)
  • ICCRG Wiki, IRTF (Internet Research Task Force) ICCRG (Internet Congestion Control Research Group), includes bibliography on congestion control.
  • RFC 6077: Open Research Issues in Internet Congestion Control, D. Papadimitriou (Ed.), M. Welzl, M. Scharf, B. Briscoe, February 2011

Related Work

  • PFLDnet (International Workshop on Protocols for Fast Long-Distance Networks): 2003, 2004, 2005, 2006, 2007, 2009, 2010. (The 2008 pages seem to have been removed from the Net.)

-- Simon Leinen - 2005-05-28–2016-09-18 -- Orla Mc Gann - 2005-10-10 -- Chris Welti - 2008-02-25

Revision 112006-03-29 - SimonLeinen

Line: 1 to 1
 
META TOPICPARENT name="AutoToc"

Unrolled PERTKB Contents

Line: 117 to 117
 

Integrated Services (IntServ)

IntServ was an attempt by the IETF to add Quality of Service differentiation to the Internet architecture. Components of the Integrated Services architecture include

  • A set of predefined service classes with different parameters, in particular Controlled Load and Guaranteed Service
  • A ReSerVation Protocol (RSVP) for setting up specific service parameters for a flow
  • Mappings to different lower layers such as ATM, Ethernet, or low-speed links.

Concerns with IntServ include its scaling properties when many flow reservations are active in the "core" parts of the network, the difficulties of implementing the necessary signaling and packet treatment functions in high-speed routers, and the lack of policy control and accounting/billing infrastructure to make this worthwhile for operators. While IntServ never became widely implemented beyond intra-enterprise environments, RSVP has found new uses as a signaling protocol for Multi-Protocol Label Switching (MPLS).

As an alternative to IntServ, the IETF later developed the Differentiated Services architecture, which provides simple building blocks that can be composed to similarly granular services.

References

Integrated Services Architecture

  • RFC 1633, Integrated Services in the Internet Architecture: an Overview, R. Braden, D. Clark, S. Shenker, 1994

RSVP

  • RFC 2205, Resource <ReSerVation Protocol (RSVP) -- Version 1 Functional Specification, R. Braden, Ed., L. Zhang, S. Berson, S. Herzog, S. Jamin, September 1997
  • RFC 2208, Resource ReSerVation Protocol (RSVP) -- Version 1 Applicability Statement Some Guidelines on Deployment, A. Mankin, Ed., F. Baker, B. Braden, S. Bradner, M. O'Dell, A. Romanow, A. Weinrib, L. Zhang, September 1997
  • RFC 2209, Resource ReSerVation Protocol (RSVP) -- Version 1 Message Processing Rules, R. Braden, L. Zhang, September 1997
  • RFC 2210, The Use of RSVP with IETF Integrated Services, J. Wroclawski, September 1997
  • RFC 2211, Specification of the Controlled-Load Network Element Service, J. Wroclawski, September 1997
  • RFC 2212, Specification of Guaranteed Quality of Service, S. Shenker, C. Partridge, R. Guerin, September 1997

Lower-Layer Mappings

  • RFC 2382, A Framework for Integrated Services and RSVP over ATM, E. Crawley, Ed., L. Berger, S. Berson, F. Baker, M. Borden, J. Krawczyk, August 1998
  • RFC 2689, Providing Integrated Services over Low-bitrate Links, C. Bormann, September 1999
  • RFC 2816, A Framework for Integrated Services Over Shared and Switched IEEE 802 LAN Technologies, A. Ghanwani, J. Pace, V. Srinivasan, A. Smith, M. Seaman, May 2000

-- SimonLeinen - 28 Feb 2006

Sizing of Network Buffers

Where temporary congestion cannot be avoided, some buffering in network nodes is required (in routers and other packet-forwarding devices such as Ethernet or MPLS switches) to queue incoming packets until they can be transmitted. The appropriate sizing of these buffers has been a subject of discussion for a long time.

Traditional wisdom recommends that a network node should be able to buffer an end-to-end round-trip time's worth of line-rate traffic, in order to be able to accomodate bursts of TCP traffic. This recommendation is often followed in "core" IP networks. For example, FPC (Flexible PIC Concentrators) on Juniper's M- and T-Series routers contain buffer memory for 200ms (M-series) or 100ms (T-series) at the supported interface bandwidth (cf. Juniper M-Series Datasheet and a posting from 8 May, 2005 by Hannes Gredler to the juniper-nsp mailing list.) These ideas also influenced RFC 3819, Advice for Internet Subnetwork Designers.

Recent research results suggest that much smaller buffers are sufficient when there is a high degree of multiplexing of TCP streams.

This work is highly relevant, because overly large buffers not only require more (expensive high-speed) memory, but bring about a risk of high delays that affect perceived quality of service; see BufferBloat.

References

ACM SIGCOMM Computer Communications Review

The October 2006 edition has a short summary paper on router buffer sizing. If you read one article, read this!

The July 2005 edition (Volume 35 , Issue 3) has a special feature about sizing router buffers, containing of these articles:
Making router buffers much smaller
Nick McKeown, Damon Wischik
Part I: buffer sizes for core routers
Nick McKeown, Damon Wischik
Part II: control theory for buffer sizing
Gaurav Raina, Don Towsley, Damon Wischik
Part III: routers with very small buffers
Mihaela Enachescu, Yashar Ganjali, Ashish Goel, Nick McKeown, Tim Roughgarden

Sizing Router Buffers (copy)
Guido Appenzeller Isaac Keslassy Nick McKeown, SIGCOMM'04, in: ACM Computer Communications Review 34(4), pp. 281--292

The Effect of Router Buffer Size on HighSpeed TCP Performance
Dhiman Barman, Georgios Smaragdakis and Ibrahim Matta. In Proceedings of IEEE Globecom 2004. (PowerPoint presentation)

Link Buffer Sizing: a New Look at the Old Problem
Sergey Gorinsky, A. Kantawala, and J. Turner, ISCC-05, June 2005
Another version was published as Technical Report WUCSE-2004-82, Department of Computer Science and Engineering, Washington University in St. Louis, December 2004.

Effect of Large Buffers on TCP Queueing Behavior
Jinsheng Sun, Moshe Zukerman, King-Tim Ko, Guanrong Chen and Sammy Chan, IEEE INFOCOM 2004

High Performance TCP in ANSNET
C. Villamizar and C. Song., in: ACM Computer Communications Review, 24(5), pp.45--60, 1994

RFC 3819, Advice for Internet Subnetwork Designers
P. Karn, Ed., C. Bormann, G. Fairhurst, D. Grossman, R. Ludwig, J. Mahdavi, G. Montenegro, J. Touch, L. Wood. July 2004

-- SimonLeinen - 2005-01-07 - 2013-04-03

OS-specific Tuning: Cisco IOS

  • Cisco IOS specific Tuning
  • TCP MSS Adjustment The TCP MSS Adjustment feature enables the configuration of the maximum segment size (MSS) for transient packets that traverse a router, specifically TCP segments in the SYN bit set, when Point to Point Protocol over Ethernet (PPPoE) is being used in the network. PPPoE truncates the Ethernet maximum transmission unit (MTU) 1492, and if the effective MTU on the hosts (PCs) is not changed, the router in between the host and the server can terminate the TCP sessions. The ip tcp adjust-mss command specifies the MSS value on the intermediate router of the SYN packets to avoid truncation.

-- HankNussbacher - 06 Oct 2005

Added:
>
>

Support for Large Frames/Packets ("Jumbo MTU")

On current research networks (and most other parts of the Internet), end nodes (hosts) are restricted to a 1500-byte IP Maximum Transmission Unit (MTU). Because a larger MTU would save effort on the part of both hosts and network elements - fewer packets per volume of data means less work - many people have been advocating efforts to increase this limit, in particular on research network initiatives such as Internet2 and GÉANT.

Impact of Large Packets

Improved Host Performance for Bulk Transfers

It has been argued that TCP is most efficient when the payload portions of TCP segments match integer multiples of the underlying virtual memory system's page size - this permits "page flipping" techniques in zero-copy TCP implementations. A 9000 byte MTU would "naturally" lead to 8960-byte segments, which doesn't correspond to any common page size. However, a TCP implementation should be smart enough to adapt segment size to page sizes where this actually matters, for example by using 8192-byte segments even though 8960-byte segments are permitted. In addition, Large Send Offload (LSO), which is becoming increasingly common with high-speed network adapters, removes the direct correspondance between TCP segments and driver transfer units, making this issue moot.

On the other hand, Large Send Offload (LSO) and Interrupt Coalescence remove most of the host-performance motivations for large MTUs: The semgentation and reassembly function between (large) application data units and (smaller) network packets is mostly moved to the network interface controller.

Lower Packet Rate in the Backbone

Another benefit of large frames is that their use reduces the number of packets that have to be processed by routers, switches, and other devices in the network. However, most high-speed networks are not limited by per-packet processing costs, and packet processing capability is often dimensioned for the worst case, i.e. the network should continue to work even when confronted with a flood of small packets. On the other hand, per-packet processing overhead may be an issue for devices such as firewalls, which have to do significant processing of the headers (but possibly not the contents) of each packet.

Reduced Framing Overhead

Another advantage of large packets is that they reduce the overhead for framing (headers) in relation to payload capacity. A typical TCP segment over IPv4 carries 40 bytes of IP and TCP header, plus link-dependent framing (e.g. 14 bytes over Ethernet). This represents about 3% of overhead with the customary 1500-byte MTU, whereas an MTU of 9000 bytes reduces this overhead to 0.5%.

Network Support for Large MTUs

Research Networks

The GÉANT/GÉANT2 and Abilene backbones now both support a 9000-byte IP MTU. The current access MTUs of Abilene connectors can be found on the Abilene Connector Technology Support page. The access MTUs of the NRENs to GÉANT are listed in the GÉANT Monthly Report (not publicly available).

Commercial Internet Service Providers

Many commercial ISPs support larger-than-1500-byte MTUs in their backbones (4470 bytes is a typical value) and on certain types of access interfaces, i.e. Packet-over-SONET (POS). But when Ethernet is used as an access interface, as is more and more frequently the case, the access MTU is usually set to the 1500 bytes corresponding to the Ethernet standard frame size limit. In addition, many inter-provider connections are over shared Ethernet networks at public exchange points, which also imply a 1500-byte limit, with very few exceptions - MAN LAN is a rare example of an Ethernet-based access point that explicitly supports the use of larger frames.

Possible Issues

Path MTU Discovery issues

Moving to larger MTUs may exhibit problems with the traditional Path MTU Discovery mechanism - see that topic for more information about this.

Inconsistent MTUs within a subnet

There are other deployment considerations that make the introduction of large MTUs tricky, in particular the requirement that all hosts on a logical IP subnet must use identical MTUs; this makes gradual introduction hard for large bridged (switched) campus or data center networks, and also for most Internet Exchange Points.

When different MTUs are used on a link (logical subnet), this can often go unnoticed for a long time. Packets smaller than the minimum MTU will always pass through the link, and the end with the smaller MTU configured will always fragment larger packets towards the other side. The only packets that will be affected are packets from the larger-MTU side to the smaller-MTU side that are larger than the smaller MTU. Those will typically be sent unfragmented, and dropped at the receiving end (the one with the smaller MTU). However, it may happen that such packets rarely occur in normal situations, and the misconfiguration isn't detected immediately.

Some routing protocols such as OSPF and IS-IS detect MTU mismatches and will refuse to build adjacencies in this case. This helps diagnose configurations. Other protocols, in particular BGP, will appear to work as long as only small amounts of data are sent (e.g. during initial handshake and option negotiation), but get stuck when larger amounts of data (i.e. the initial route advertisements) must be sent in the bigger-to-smaller-MTU direction.

Problems with Large MTUs on end systems network interface cards (NICs)

On some NICs it's possible to configure jumbo frames (for example MTU=9000) and the NIC is working fine when checking it's functionality with pings (jumbo sized ICMP packets), altough the NIC vendor states there is no jumbo frame support.
In that cases there are packet losses on the NIC if it receives jumbo frames with typical production data rates.

Therefore the NIC vendor information should be consulted before activating Large MTUs on the host interface. Then high data rate tests should be done with large MTUs. The hosts interface statistics should be checked on input packet drops.
Typical commands on unix systems: 'ifconfig <ifname>' or 'ethtool -S <ifname>'


References

Vendor-specific

-- SimonLeinen - 16 Mar 2005 - 02 Sep 2009
-- HankNussbacher - 18 Jul 2005 (added Phil Dykstra paper)
 An evil middlebox is a transparent device that sits inbetween an end-to-end connection that disturbs the normal end-to-end traffic in some way. As you can not see these devices which usually work on layer 2, it is difficult to debug issues that involve them. Examples are HTTP proxy, Gateway proxy (all protocols). Normally, these devices are installed for security reasons to filter out "bad" traffic. Bad traffic may be viri, trojans, evil javascript, or anything that is not known to the device. Sometimes also so called rate shapers are installed as middleboxes; while these do not change the contents of the traffic, they do drop packets according to rules only known by themselves. Bugs in such middleboxes can have fatal consequences for "legitimate" Internet traffic which may lead to performance or even worse connection issues.

Middleboxes come in all shapes and flavors. The most popular are firewalls:

Examples of experienced performance issues

Two examples in the beginning of 2005 in SWITCH:

  • HttpProxy: very slow response from a webserver only for a specific circle of people

  • GatewayProxy: tcp transfers get stalled as soon as a packet is lost on the local segment from the middlebox to the end host.

A Cisco IOS Firewall in August 2006 in Funet:

  • WindowScalingProblems: when window scaling was enabled, TCP performance was bad (10-20 KBytes/sec). Some older versions of PIX could also be affected by window scaling issues.

DNS Based global load balancing problems

Juniper SRX3600 mistreats fragmented IPv6 packets

This firewall (up to at least version 11.4R3.7) performs fragment reassembly in order to apply certain checks to the entire datagram, for example in "DNS ALG" mode. It then tries to forward the reassembled packet instead of the initial fragments, which triggers ICMP "packet too big" messages if the full datagram is larger than the MTU of the next link. This will lead to a permanent failure on this path, because the (correct) fragmentation at the sender is annihilated by the erroneous reassembly at the firewall.

The same issue has also been found with some models of the Fortigate firewall.

-- ChrisWelti - 01 Mar 2005
-- PekkaSavola - 10 Oct 2006

-- PekkaSavola - 07 Nov 2006

-- AlexGall - 2012-10-31 Symptom: Accessing a specific web-site which contains javascript, is very slow (around 30 seconds for one page)

Analysis Summary: HTTP traffic is split between the webserver and a transparent HTTP proxy on the customer site and the HTTP proxy server and the end-hosts. The transparent HTTP proxy fakes the end-points; to the HTTP web server it pretends to be the customer accessing it and to the customer the HTTP proxy appears to be the web server (faked IP addresses). Accordingly there are 2 TCP connections to be considered here. The proxy receives a HTTP request from the customer to the webserver. It then forwards this request to the webserver and WAITS until it has received the whole reply (this is essential, as it needs to analyze the whole reply to decide if it is bad or not). If the content of that HTTP reply is dynamic, the length is not known. With HTTP1.1 a TCP session is not built for every object but remains intact untill a timeout has occured.This means the proxy has to wait until the TCP session gets torn down, to be sure there is not more content coming. When it has received the whole reply it will forward that reply to the customer who asked for it. Of course the customer will suffer from a major delay.

-- ChrisWelti - 03 Mar 2005 Warning: Can't find topic PERTKB.SshProxy

Revision 102006-03-28 - SimonLeinen

Line: 1 to 1
 
META TOPICPARENT name="AutoToc"

Unrolled PERTKB Contents

Line: 60 to 60
 

Netperf

A client/server network performance benchmark (debian package: netperf).

Example

coming soon

Flent (previously netperf-wrapper)

For his measurement work on fq_codel, Toke Høiland-Jørgensen developed Python wrappers for Netperf and added some tests, notably the RRUL (Realtime Response Under Load) test, which runs bulk TCP transfers in parallel with UDP and ICMP response-time measurements. This can be used to expose Bufferbloat on a path.

These tools used to be known as "netperf-wrappers", but were renamed to Flent (the Flexible Network Tester) in early 2015.

References

-- FrancoisXavierAndreu & SimonMuyal - 2005-06-06
-- SimonLeinen - 2013-03-30 - 2015-06-01

RUDE/CRUDE

RUDE is a package of applications to generate and measure UDP traffic between two endpoints (hosts). The rude tool generates traffic to the network, which can be received and logged on the other side of the network with crude. The traffic pattern can be defined by the user. This tool is available under http://rude.sourceforge.net/

Example

The following rude script will cause a constant-rate packet stream to be sent to a destination for 4 seconds, starting one second (1000 milliseconds) from the time the script was read, stopping five seconds (5000 milliseconds) after the script was read. The script can be called using rude -s example.rude.

START NOW
1000 0030 ON 3002 10.1.1.1:10001 CONSTANT 200 250
5000 0030 OFF

Explanation of the values: 1000 and 5000 are relative timestamps (in milliseconds) for the actions. 0030 is the identification of a traffic flow. At 1000, a CONSTANT-rate flow is started (ON) with source port 3002, destination address 10.1.1.1 and destination port 10001. CONSTANT 200 250 specifies a fixed rate of 200 packets per second (pps) of 250-byte UDP datagrams.

-- FrancoisXavierAndreu & SimonMuyal - 06 Jun 2005
-- SimonLeinen - 28 Sep 2008

TTCP

TTCP (Test TCP) is a utility for benchmarking UDP and TCP performance. Utility for Unix and Windows is available at http://www.pcausa.com/Utilities/pcattcp.htm/

-- FrancoisXavierAndreu & SimonMuyal - 06 Jun 2005

Deleted:
<
<

Web100 Linux Kernel Extensions

The Web100 project was run by PSC (Pittsburgh Supercomputing Center), the NCAA and NCSA. It was funded by the US National Science Foundation (NSF) between 2000 and 2003, although development and maintenance work extended well beyond the period of NSF funding. Its thrust was to close the "wizard gap" between what performance should be possible on modern research networks and what most users of these networks actually experience. The project focused on instrumentation for TCP to measure performance and find possible bottlenecks that limit TCP throughput. In addition, it included some work on "auto-tuning" of TCP buffer settings.

Most implementation work was done for Linux, and most of the auto-tuning code is now actually included in the mainline Linux kernel code (as of 2.6.17). The TCP kernel instrumentation is available as a patch from http://www.web100.org/, and usually tracks the latest "official" Linux kernel release pretty closely.

An interesting application of Web100 is NDT, which can be used from any Java-enabled browser to detect bottlenecks in TCP configurations and network paths, as well as duplex mismatches using active TCP tests against a Web100-enabled server.

In September 2010, the NSF agreed to fund a follow-on project called Web10G.

TCP Kernel Information Set (KIS)

A central component of Web100 is a set of "instruments" that permits the monitoring of many statistics about TCP connections (sockets) in the kernel. In the Linux implementation, these instruments are accessible through the proc filesystem.

TCP Extended Statistics MIB (TCP-ESTATS-MIB, RFC 4898)

The TCP-ESTATS-MIB (RFC 4898) includes a similar set of instruments, for access through SNMP. It has been implemented by Microsoft for the Vista operating system and later versions of Windows. In the Windows Server 2008 SDK, a tool called TcpAnalyzer.exe can be used to look at statistics of open TCP connections. IBM is also said to have an implementation of this MIB.

"Userland" Tools

Besides the kernel extension, Web100 comprises a small set of user-level tools which provide access to the TCP KIS. These tools include

  1. a libweb100 library written in C
  2. the command-line tools readvar, deltavar, writevar, and readall
  3. a set of GTK+-based GUI (graphical user interface) tools under the gutil command.

gutil

When started, gutil shows a small main panel with an entry field for specifying a TCP connection, and several graphical buttons for starting different tools on a connection once one has been selected.

gutil: main panel

The TCP connection can be chosen either by explicitly specifying its endpoints, or by selecting from a list of connections (using double-click):

gutil: TCP connection selection dialog

Once a connection of interest has been selected, a number of actions are possible. The "List" action provides a list of all kernel instruments for the connection. The list is updated every second, and "delta" values are displayed for those variables that have changed.

gutil: variable listing

Another action is "Display", which provides a graphical display of a KIS variable. The following screenshot shows the display of the DataBytesIn variable of an SSH connection.

gutil: graphical variable display

Related Work

Microsoft's Windows Software Development Kit for Server 2008 and .NET Version 3.5 contains a tool called TcpAnalyzer.exe, which is similar to gutil and uses Microsoft's RFC 4898 implementation.

The SIFTR module for FreeBSD can be used for similar applications, namely to understand what is happening inside TCP at fine granularity. Sun's DTrace would be an alternative on systems that support it, provided they offer suitable probes for the relevant actions and events within TCP. Both SIFTR and DTrace have very different user interfaces to Web100.

References

-- SimonLeinen - 27 Feb 2006 - 25 Mar 2011
-- ChrisWelti - 12 Jan 2010

 

NDT (Network Diagnostic Tool)

NDT can be used to check the TCP configuration of any host that can run Java applets. The client connects to a Web page containing a special applet. The Web server that serves the applet must run a kernel with the Web100 extensions for TCP measurement instrumentation. The applet performs TCP memory-to-memory tests between the client and the Web100-instrumented server, and then uses the measurements from the Web100 instrumentation to find out about the TCP configuration of the client. NDT also detects common configuration errors such as duplex mismatch.

Besides the applet, there is also a command-line client called web100clt that can be used without a browser. The client works on Linux without any need for Web100 extensions, and can be compiled on other Unix systems as well. It doesn't require a Web server on the server side, just the NDT server web100srv - which requires a Web100-enhanced kernel.

Applicability

Since NDT performs memory-to-memory tests, it avoids end-system bottlenecks such as file system or disk limitations. Therefore it is useful for estimating "pure" TCP throughput limitations for a system, better than, for instance, measuring the throughput of a large file retrieval from a public FTP or WWW server. In addition, NDT servers are supposed to be well-connected and lightly loaded.

When trying to tune a host for maximum throughput, it is a good idea to start testing it against an NDT server that is known to be relatively close to the host in question. Once the throughput with this server is satisfactory, one can try to further tune the host's TCP against a more distant NDT server. Several European NRENs as well as many sites in the U.S. operate NDT servers.

Example

Applet (tcp100bw.html)

This example shows the results (overview) of an NDT measurement between a client in Switzerland (Gigabit Ethernet connection) and an IPv6 NDT server in Ljubljana, Slovenia (Gigabit Ethernet connection).

Screen_shot_2011-07-25_at_18.11.44_.png

Command-line client

The following is an example run of the web100clt command-line client application. In this case, the client runs on a small Linux-based home router connected to a commercial broadband-over-cable Internet connection. The connection is marketed as "5000 kb/s downstream, 500 kb/s upstream", and it can be seen that this nominal bandwidth can in fact be achieved for an individual TCP connection.

$ ./web100clt -n ndt.switch.ch
Testing network path for configuration and performance problems  --  Using IPv6 address
Checking for Middleboxes . . . . . . . . . . . . . . . . . .  Done
checking for firewalls . . . . . . . . . . . . . . . . . . .  Done
running 10s outbound test (client to server) . . . . .  509.00 kb/s
running 10s inbound test (server to client) . . . . . . 5.07 Mb/s
Your host is connected to a Cable/DSL modem
Information [C2S]: Excessive packet queuing detected: 10.18% (local buffers)
Information [S2C]: Excessive packet queuing detected: 73.20% (local buffers)
Server 'ndt.switch.ch' is not behind a firewall. [Connection to the ephemeral port was successful]
Client is probably behind a firewall. [Connection to the ephemeral port failed]
Information: Network Middlebox is modifying MSS variable (changed to 1440)
Server IP addresses are preserved End-to-End
Client IP addresses are preserved End-to-End

Note the remark "-- Using IPv6 address". The example used the new version 3.3.12 of the NDT software, which includes support for both IPv4 and IPv6. Because both the client and the server support IPv6, the test is run over IPv6. To run an IPv4 test in this situation, one could have called the client with the -4 option, i.e. web100clt -4 -n ndt.switch.ch.

Public Servers

There are many public NDT server installations available on the Web. You should choose between the servers according to distance (in terms of RTT) and the throughput range you are interested in. The limits of your local connectivity can best be tested by connecting to a server with a fast(er) connection that is close by. If you want to tune your TCP parameters for "LFNs", use a server that is far away from you, but still reachable over a path without bottlenecks.

Development

Development of NDT is now hosted on Google Code, and as of July 2011 there is some activity. Notably, there is now documentation about the architecture as well as the detailed protocol or NDT. One of the stated goals is to allow outside parties to write compatible NDT clients without consulting the NDT source code.

References

-- SimonLeinen - 30 Jun 2005 - 22 Nov 2011

Active Measurement Boxes

-- FrancoisXavierAndreu & SimonLeinen - 06 Jun 2005 - 06 Jul 2005 Warning: Can't find topic PERTKB.IppmBoxes

Line: 76 to 75
 

libtrace

libtrace is a library for trace processing. It supports multiple input methods, including device capture, raw and gz-compressed trace, and sockets; and mulitple input formats, including pcap and DAG.

References

-- SimonLeinen - 04 Mar 2006

Netdude

Netdude (Network Dump data Displayer and Editor) is a framework for inspection, analysis and manipulation of tcpdump (libpcap) trace files.

References

-- SimonLeinen - 04 Mar 2006

jnettop

Captures traffic coming across the host it is running on and displays streams sorted by the bandwidth they use. The result is a nice listing of communication on network grouped by stream, which shows transported bytes and consumed bandwidth.

References

-- FrancoisXavierAndreu - 06 Jun 2005 -- SimonLeinen - 04 Mar 2006

Added:
>
>

Host and Application Measurement Tools

Network performance is only one part of a distrbuted system's perfomance. An equally important part is the performance od the actual application and hardware it runs on.

  • Web100 - fine-grained instrumentation of TCP implementation, currently available for the Linux kernel.
  • SIFTR - TCP instrumentation similar to Web100, available for *BSD
  • DTrace - Dynamic tracing facility available in Solaris and *BSD
  • NetLogger - a framework for instrumentation and log collection from various components of networked applications

-- TobyRodwell - 22 Mar 2006
-- SimonLeinen - 28 Mar 2006

NetLogger

NetLogger (copyright Lawrence Berkeley National Laboratory) is both a set of tools and a methodology for analysing the performance of a distributed system. The methodolgoy (below) can be implemented separately from the LBL developed tools. For full information on NetLogger see its website, http://dsd.lbl.gov/NetLogger/

Methodology

From the NetLogger website

The NetLogger methodology is really quite simple. It consists of the following:

  1. All components must be instrumented to produce monitoring These components include application software, middleware, operating system, and networks. The more components that are instrumented the better.
  2. All monitoring events must use a common format and common set of attributes. Monitoring events most also all contain a precision timestamp which is in a single timezone (GMT) and globally synchronized via a clock synchronization method such as NTP.
  3. Log all of the following events: Entering and exiting any program or software component, and begin/end of all IO (disk and network).
  4. Collect all log data in a central location
  5. Use event correlation and visualization tools to analyze the monitoring event logs.

Toolkit

From the NetLogger website

The NetLogger Toolkit includes a number of separate components which are designed to help you do distributed debugging and performance analysis. You can use any or all of these components, depending on your needs.

These include:

  • NetLogger message format and data model: A simple, common message format for all monitoring events which includes high-precision timestamps
  • NetLogger client API library: C/C++, Java, PERL, and Python calls that you add to your existing source code to generate monitoring events. The destination and logging level of NetLogger messages are all easily controlled using an environment variable. These libraries are designed to be as lightweight as possible, and to never block or adversely affect application performance.
  • NetLogger visualization tool (nlv): a powerful, customizable X-Windows tool for viewing and analysis of event logs based on time correlated and/or object correlated events.
  • NetLogger host/network monitoring tools: a collection of NetLogger-instrumented host monitoring tools, including tools to interoperate with Ganglia and MonALisa.
  • NetLogger storage and retrieval tools, including a daemon that collects NetLogger events from several places at a single, central host; a forwarding daemon to forward all NetLogger files in a specified directory to a given location; and a NetLogger event archive based on mySQL.

References

-- TobyRodwell - 22 Mar 2006

Web100 Linux Kernel Extensions

The Web100 project was run by PSC (Pittsburgh Supercomputing Center), the NCAA and NCSA. It was funded by the US National Science Foundation (NSF) between 2000 and 2003, although development and maintenance work extended well beyond the period of NSF funding. Its thrust was to close the "wizard gap" between what performance should be possible on modern research networks and what most users of these networks actually experience. The project focused on instrumentation for TCP to measure performance and find possible bottlenecks that limit TCP throughput. In addition, it included some work on "auto-tuning" of TCP buffer settings.

Most implementation work was done for Linux, and most of the auto-tuning code is now actually included in the mainline Linux kernel code (as of 2.6.17). The TCP kernel instrumentation is available as a patch from http://www.web100.org/, and usually tracks the latest "official" Linux kernel release pretty closely.

An interesting application of Web100 is NDT, which can be used from any Java-enabled browser to detect bottlenecks in TCP configurations and network paths, as well as duplex mismatches using active TCP tests against a Web100-enabled server.

In September 2010, the NSF agreed to fund a follow-on project called Web10G.

TCP Kernel Information Set (KIS)

A central component of Web100 is a set of "instruments" that permits the monitoring of many statistics about TCP connections (sockets) in the kernel. In the Linux implementation, these instruments are accessible through the proc filesystem.

TCP Extended Statistics MIB (TCP-ESTATS-MIB, RFC 4898)

The TCP-ESTATS-MIB (RFC 4898) includes a similar set of instruments, for access through SNMP. It has been implemented by Microsoft for the Vista operating system and later versions of Windows. In the Windows Server 2008 SDK, a tool called TcpAnalyzer.exe can be used to look at statistics of open TCP connections. IBM is also said to have an implementation of this MIB.

"Userland" Tools

Besides the kernel extension, Web100 comprises a small set of user-level tools which provide access to the TCP KIS. These tools include

  1. a libweb100 library written in C
  2. the command-line tools readvar, deltavar, writevar, and readall
  3. a set of GTK+-based GUI (graphical user interface) tools under the gutil command.

gutil

When started, gutil shows a small main panel with an entry field for specifying a TCP connection, and several graphical buttons for starting different tools on a connection once one has been selected.

gutil: main panel

The TCP connection can be chosen either by explicitly specifying its endpoints, or by selecting from a list of connections (using double-click):

gutil: TCP connection selection dialog

Once a connection of interest has been selected, a number of actions are possible. The "List" action provides a list of all kernel instruments for the connection. The list is updated every second, and "delta" values are displayed for those variables that have changed.

gutil: variable listing

Another action is "Display", which provides a graphical display of a KIS variable. The following screenshot shows the display of the DataBytesIn variable of an SSH connection.

gutil: graphical variable display

Related Work

Microsoft's Windows Software Development Kit for Server 2008 and .NET Version 3.5 contains a tool called TcpAnalyzer.exe, which is similar to gutil and uses Microsoft's RFC 4898 implementation.

The SIFTR module for FreeBSD can be used for similar applications, namely to understand what is happening inside TCP at fine granularity. Sun's DTrace would be an alternative on systems that support it, provided they offer suitable probes for the relevant actions and events within TCP. Both SIFTR and DTrace have very different user interfaces to Web100.

References

-- SimonLeinen - 27 Feb 2006 - 25 Mar 2011
-- ChrisWelti - 12 Jan 2010

 

NREN Tools and Statistics

A list of all NRENs and other networks who offer statistics/info for their networks and/or have tools available for PERT staff to use.

Outside Europe:

For information on other NRENs' monitoring tools please see the traffic monitoring URLs section of the TERENA Compendium

A list of network weathermaps is maintained in the JRA1 Wiki.

-- TobyRodwell - 14 Apr 2005 -- SimonLeinen - 20 Sep 2005 - 05 Aug 2011 -- FrancoisXavierAndreu - 05 Apr 2006 -- LuchesarIliev - 27 Apr 2006

GÉANT Tools

PERT Staff can access a range of tools for checking the status of the GÉANT network by logging into https://tools.geant.net/portal/ (if a member of the PERT does not have an account for this site they should contact DANTE operations operations@dante.org.uk).

The following tools are available:

GÉANT Usage Map: For each country currently connected to GÉANT, this map shows the level of usage of the access link connecting the country's national research network to GÉANT.

Network Weathermap: Map showing the GÉANT routers and the circuits interconnecting them. The circuits are coloured according to their utilisation. LOGIN REQUIRED

GÉANT Looking Glass: this allows users to execute basic operational commands on GÉANT routers, including ping, traceroute and show route

IPv4 and IPv6 Beacon: Beacon nodes are located in the majority of GÉANT PoPs. They send and receive a specific mulitcast group, and from this are able to infer the performance of all multicast traffic throughout GÉANT.

HADES Measure Points: These network performance Measurement Points (MPs) run DFN's HADES IPPM software (OWD, OWDV, packet loss) and also iperf/BWCTL. PERT engineers can log in to the devices and run AB tests (they should use common sense so as not to overload low capacity or already heavily loaded paths).

BWCTL Measurement Points: NRENs can run BWCTL/iperf tests to, from and between GEANT2's BWCTL servers. See GeantToolsDanteBwctl for more information.

-- TobyRodwell - 05 Apr 2005 -- SimonLeinen - 20 Sep 2005 - 05 Aug 2011

Performance-Related Tools at SWITCH

  • IPv4 Looking Glass -- allows running IPv4 ping, traceroute, and several BGP monitoring commands on external border routers
  • IPv6 Looking Glass -- like the above, but for IPv6
  • Network Diagnostic Tester (NDT) -- run TCP throughput tests to a Web100-instrumented server, and detect configuration and cabling issues on your client. Java 1.4 or newer required
  • BWCTL & OWAMP testboxes -- run bwctl (iperf) and one-way delay tests (owamp) to our PMPs
  • SmokePing -- permanent RTT ("ping") measurements to select places on the Internet, with good visualization
  • RIPE TTM -- tt85 (at SWITCH's PoP at ETH Zurich) and tt86 (at the University of Geneva) show one-way delays, delay variations and loss from/to SWITCH from many other networks.
  • IEPM-BW -- regular throughput measurements from SLAC to many hosts, including one at SWITCH
  • Multicast beacons -- IPv4/IPv6 multicast reachability matrices

-- AlexGall - 15 Apr 2005

Revision 92006-03-21 - SimonLeinen

Line: 1 to 1
 
META TOPICPARENT name="AutoToc"

Unrolled PERTKB Contents

Line: 9 to 9
 

Network Performance Metrics

There are many metrics that are commonly used to characterize the performance of networks and parts of networks. We present the most important of these, explain what influences them, how they can be measured, how they influence end-to-end performance, and what can be done to improve them.

A framework for network performance metrics has been defined by the IETF's IP Performance Metrics (IPPM) Working Group in RFC 2330. The group also developed definitions for several specific performance metrics; those are referenced from the respective sub-topics.

References

-- SimonLeinen - 31 Oct 2004

One-way Delay (OWD)

One-way delay is the time it takes for a packet to reach its destination. It is considered a property of network links or paths. RFC 2679 contains the definition of the one-way delay metric of the IETF's IPPM (IP Performance Metrics) working group.

Decomposition

One-way delay along a network path can be decomposed into per-hop one-way delays, and these in turn into per-link and per-node delay components.

Per-link delay components: propagation delay and serialization delay

The link-specific component of one-way delay consists of two sub-components:

Propagation Delay is the time it takes for signals to move from the sending to the receiving end of the link. On simple links, this is the product of the link's physical length and the characteristical propagation speed. The velocity of propagation (VoP) for copper and fibre optic are similar (the VoP of copper is slightly faster), being approximately 2/3 the speed of light in a vacuum.

Serialization delay is the time it takes for a packet to be serialized into link transmission units (typically bits). It is the packet size (in bits) divided by the link's capacity (in bits per second).

In addition to the propagation and serialization delays, some types of links may introduce additional delays, for example to avoid collisions on a shared link, or when the link-layer transparently retransmits packets that were damaged during transmission.

Per-node delay components: forwarding delay, queueing delay

Within a network node such as a router, a packet experiences different kinds of delay between its arrival on one link and its departure on another (or the same) link:

Forwarding delay is the time it takes for the node to read forwarding-relevant information (typically the destination address and other headers) from the packet, compute the "forwarding decision" based on routing tables and other information, and to actually forward the packet towards the destination, which involves copying the packet to a different interface inside the node, rewriting parts of it (such as the IP TTL and any media-specific headers) and possibly other processing such as fragmentation, accounting, or checking access control lists.

Depending on router architecture, forwarding can compete for resources with other activities of the router. In this case, packets can be held up until the router's forwarding resource is available, which can take many milliseconds on a router with CPU-based forwarding. Routers with dedicated hardware for forwarding don't have this problem, although there may be delays when a packet arrives as the hardware forwarding table is being reprogrammed due to a routing change.

Queueing delay is the time a packet has to wait inside the node waiting for availability of the output link. Queueing delay depends on the amount of competing traffic towards the output link, and of the priorities of the packet itself and the competing traffic. The amount of queueing that a given packet may encounter is hard to predict, but it is bounded by the buffer size available for queueing.

There can be causes for queueing other than contention on the outgoing link, for example contention on the node's backplane interconnect.

Impact on end-to-end performance

When studying end-to-end performance, it is usually more interesting to look at the following metrics that are derived from one-way delay:

  • Round-trip time (RTT) is the time from node A to B and back to A. It is the sum of the one-way delays from A to B and from B to A, plus the response time in B.
  • Delay Variation represents the variation in one-way delay. It is important for real-time applications. People often call this "jitter".

Measurement

One-way delays from a node A to a node B can be measured by sending timestamped packets from A, and recording the reception times at B. The difficulty is that A and B need clocks that are synchronized to each other. This is typically achieved by having clocks synchronized to a standard reference time such as UTC (Universal Time Coordinated) using techniques such as Global Positioning System (GPS)-derived time signals or the Network Time Protocol (NTP).

There are several infrastructures that continuously measure one-way delays and packet loss: The HADES Boxes in DFN and GÉANT2; RIPE TTM Boxes between various research and commercial networks, and RENATER's QoSMetrics boxes.

OWAMP and RUDE/CRUDE are examples of tools that can be used to measure one-way delay.

Improving delay

Shortest-(Physical)-Path Routing and Proper Provisioning

On high-speed wide-area network paths, delay is usually dominated by propagation times. Therefore, physical routing of network links plays an important role, as well as the topology of the network and the selection of routing metrics. Ensuring minimal delays is simply a matter of

  • using an internal routing protocol with "shortest-path routing" (such as OSPF or IS-IS) and a metric proportional to per-link delay
  • provisioning the network so that these shortest paths aren't congested.

This could be paraphrased as "use proper provisioning instead of traffic engineering".

The node contributions to delay can be addressed by:

  • using nodes with fast forwarding
  • avoiding queueing by provisioning links to accomodate typical traffic bursts
  • reducing the number of forwarding nodes

References

-- SimonLeinen - 31 Oct 2004 - 25 Jan 2007

Round-Trip Time (RTT)

Round-trip time (RTT) is the total time for a packet sent by a node A to reach is destination B, and for a response sent back by B to reach A.

Decomposition

The round-trip time is the sum of the one-way delays from A to B and from B to A, and of the time it takes B to send the response.

Impact on end-to-end performance

For window-based transport protocols such as TCP, the round-trip time influences the achievable throughput at a given window size, because there can only be a window's worth of unacknowledged data in the network, and the RTT is the lower bound for a packet to be acknowledged.

For interactive applications such as conversational audio/video, instrument control, or interactive games, the RTT represents a lower bound on response time, and thus impacts responsiveness directly.

Measurement

The round-trip time is often measured with tools such as ping (Packet InterNet Groper) or one of its cousins such as fping, which send ICMP Echo requests to a destination, and measure the time until the corresponding ICMP Echo Response messages arrive.

However, please note that while round-trip time reported by PING is relatively precise measurement, some network devices may prioritize ICMP handling routines, so the measured values do not correspond to the real values.

-- MichalPrzybylski - 19 Jul 2005

Improving the round-trip time

The network components of RTT are the one-way delays in both directions (which can use different paths), so see the OneWayDelay topic on how those can be improved. The speed of response generation can be improved through upgrades of the responding host's processing power, or by optimizing the responding program.

-- SimonLeinen - 31 Oct 2004

Added:
>
>

Bandwidth-Delay Product (BDP)

The BDP of a path is the product of the (bottleneck) bandwidth and the delay of the path. Its dimension is "information", because bandwidth (here) expresses information per time, and delay is a time (duration). Typically, one uses bytes as a unit, and it is often useful to think of BDP as the "memory capacity" of a path, i.e. the amount of data that fits entirely into the path (between two end-systems).

BDP is an important parameter for the performance of window-based protocols such as TCP. Network paths with a large bandwidth-delay product are called Long Fat Networks or "LFNs".

References

-- SimonLeinen - 14 Apr 2005

"Long Fat Networks" (LFNs)

Long Fat Networks (LFNs, pronounced like "elephants") are networks with a high bandwidth-delay product.

One of the issues with this type of network is that it can be challenging to achieve high throughput for individual data transfers with window-based transport protocols such as TCP. LFNs are thus a main focus of research on high-speed improvements for TCP.

-- SimonLeinen - 27 Oct 2004 - 17 Jun 2005

 

Delay Variation ("Jitter")

Delay variation or "jitter" is a metric that describes the level of disturbance of packet arrival times with respect to an "ideal" pattern, typically the pattern in which the packets were sent. Such disturbances can be caused by competing traffic (i.e. queueing), or by contention on processing resources in the network.

RFC 3393 defines an IP Delay Variation Metric (IPDV). This particular metric only compares the delays experienced by packets of equal size, on the grounds that delay is naturally dependent on packet size, because of serialization delay.

Delay variation is an issue for real-time applications such as audio/video conferencing systems. They usually employ a jitter buffer to eliminate the effects of delay variation.

Delay variation is related to packet reordering. But note that the RFC 3393 IPDV of a network can be arbitrarily low, even zero, even though that network reorders packets, because the IPDV metric only compares delays of equal-sized packets.

Decomposition

Jitter is usually introduced in network nodes (routers), as an effect of queueing or contention for forwarding resources, especially on CPU-based router architectures. Some types of links can also introduce jitter, for example through collision avoidance (shared Ethernet) or link-level retransmission (802.11 wireless LANs).

Measurement

Contrary to one-way delay, one-way delay variation can be measured without requiring precisely synchronized clocks at the measurement endpoints. Many tools that measure one-way delay also provide delay variation measurements.

References

The IETF IPPM (IP Performance Metrics) Working Group has formalized metrics for IPDV, and more recently started work on an "applicability statement" that explains how IPDV can be used in practice and what issues have to be considered.

  • RFC 3393, IP Packet Delay Variation Metric for IP Performance Metrics (IPPM), C. Demichelis, P. Chimento. November 2002.
  • draft-morton-ippm-delay-var-as-03.txt, Packet Delay Variation Applicability Statement, A. Morton, B. Claise, July 2007.

-- SimonLeinen - 28 Oct 2004 - 24 Jul 2007

Packet Loss

Packet loss is the probability of a packet to be lost in transit from a source to a destination.

A One-way Packet Loss Metric for IPPM is defined in RFC 2680. RFC 3357 contains One-way Loss Pattern Sample Metrics.

Decomposition

There are two main reasons for packet loss:

Congestion

When the offered load exceeds the capacity of a part of the network, packets are buffered in queues. Since these buffers are also of limited capacity, severe congestion can lead to queue overflows, which lead to packet drops. In this context, severe congestion could mean that a moderate overload condition holds for an extended amount of time, but could also consist of the sudden arrival of a very large amount of traffic (burst).

Errors

Another reason for loss of packets is corruption, where parts of the packet are modified in-transit. When such corruptions happen on a link (due to noisy lines etc.), this is usually detected by a link-layer checksum at the receiving end, which then discards the packet.

Impact on end-to-end performance

Bulk data transfers usually require reliable transmission, so lost packets must be retransmitted. In addition, congestion-sensitive protocols such as TCP must assume that packet loss is due to congestion, and reduce their transmission rate accordingly (although recently there have been some proposals to allow TCP to identify non-congestion related losses and treat those differently).

For real-time applications such as conversational audio/video, it usually doesn't make much sense to retransmit lost packets, because the retransmitted copy would arrive too late (see delay variation). The result of packet loss is usually a degradation in sound or image quality. Some modern audio/video codecs provide a level of robustness to loss, so that the effect of occasional lost packets are benign. On the other hand, some of the most effective image compression methods are very sensitive to loss, in particular those that use relatively rare "anchor frames", and that represent the intermediate frames by compressed differences to these anchor frames - when such an anchor frame is lost, many other frames won't be able to be reconstructed.

Measurement

Packet loss can be actively measured by sending a set of packets from a source to a destination and comparing the number of received packets against the number of packets sent.

Network elements such as routers also contain counters for events such as checksum errors or queue drops, which can be retrieved through protocols such as SNMP. When this kind of access is available, this can point to the location and cause of packet losses.

Reducing packet loss

Congestion-induced packet loss can be avoided by proper provisioning of link capacities. Depending on the probability of bursts (which is somewhat difficult to estimate, taking into account both link capacities in the network, and the traffic rates and patterns of a possibly large number of hosts at the edge of the network), buffers in network elements such as routers must also be sufficient. Note that large buffers can be detrimental to other aspects of network performance, in particular one-way delay (and thus round-trip time) and delay variation.

Quality-of-Service mechanisms such as DiffServ or IntServ can be used to protect some subset of traffic against loss, but this necessarily increases packet loss for the remaining traffic.

Lastly, Active Queue Management (AQM) and Excplicit Congestion Notification (ECN) can be used to mitigate both packet loss and queueing delay (and thus one-way delay, round-trip time and delay variation).

References

  • A One-way Packet Loss Metric for IPPM, G. Almes, S. Kalidindi, M. Zekauskas, September 1999, RFC 2680
  • Improving Accuracy in Endtoend Packet Loss Measurement, J. Sommers, P. Barford, N. Duffield, A. Ron, August 2005, SIGCOMM'05 (PDF)
-- SimonLeinen - 01 Nov 2004

Packet Reordering

The Internet Protocol (IP) does not guarantee that packets are delivered in the order in which they were sent. This was a deliberate design choice that distinguishes IP from protocols such as, for instance, ATM and IEEE 802.3 Ethernet.

Decomposition

Reasons why a network may reorder packets: Usually because of some kind of parallelism, either because of a choice of alternative routes (Equal Cost Multipath, ECMP), or because of internal parallelism inside switching elements such as routers. One particular kind of packet reordering concerns packets of different sizes. Because a larger packet takes longer to transfer over a serial link (or a limited-width backplane inside a router), larger packets may be "overtaken" by smaller packets that were sent subsequently. This is usually not a concern for high-speed bulk transfers - where the segments tend to be equal-sized (hopefully Path MTU-sized), but may pose problems for naive implementations of multi-media (Audio/Video) transport.

Impact on end-to-end performance

In principle, applications that use a transport protocol such as TCP or SCTP don't have to worry about packet reordering, because the transport protocol is responsible for reassembling the byte stream (message stream(s) in the case of SCTP) into the original ordering. However, reordering can have a severe performance impact on some implementations of TCP. Recent TCP implementations, in particular those that support Selective Acknowledgements (SACK), can exhibit robust performance even in the face of reordering in the network.

Real-time media applications such as audio/video conferencing tools often experience problems when run over networks that reorder packets. This is somewhat remarkable in that all of these applications have jitter buffers to eliminate the effects of delay variation on the real-time media streams. Obviously, the code that manages these jitter buffers is often not written in a way to accomodate reordered packets sensibly, although this could be done with moderate effort.

Measurement

Packet reordering is measured by sending a numbered sequence of packets, and comparing the received sequence number sequence with the original. There are many possible ways of quantifying reordering, some of which are described in the IPPM Working Group's documents (see the references below).

The measurements here can be done differently, depending on the measurement purpose:

a) measuring reordering for particular application can be done by capturing the application traffic (e.g. using the Wireshark/Ethereal tool), injecting the same traffic pattern via traffic generator and calculating the reordering.

b) measuring maximal reordering introduced by the network can be done by injecting relatively small amount of traffic, shaped as a short bursts of long packets immediately followed by short burst of short packets, with line rate. After capture and calculation on the other end of the path, the results will reflect the worst possible packet reordering situation which may occur on particular path.

For more information please refer to the measurement page, available at packet reordering - measurement site. Please notice, that although the tool is based on older versions of reordering drafts, the metrics used are compatible with the definitions from new ones.

In particular Reorder Density and Reorder Buffer-Occupancy Density can be obtained from tcpdump by subjecting the dump files to RATtool, available at Reordering Analysis Tool Website

Improving reordering

The probability of reordering can be reduced by avoiding parallelism in the network, or by using network nodes that take care to use parallel paths in such a way that packets belonging to the same end-to-end flow are kept on a single path. This can be ensured by using a hash on the destination address, or the (source address, destination address) pair to select from the set of available paths. For certain types of parallelism this is hard to achieve, for example when a simple striping mechanism is used to distribute packets from a very high-speed interface to multiple forwarding paths.

References

-- SimonLeinen - 2004-10-28 - 2014-12-29
-- MichalPrzybylski - 2005-07-19
-- BartoszBelter - 2006-03-28

Line: 17 to 19
 

TCP (Transmission Control Protocol)

The Transmission Control Protocol (TCP) is the prevalent transport protocol used on the Internet today. It uses window-based transmission to provide the service of a reliable byte stream, and adapts the rate of transfer to the state (of congestion) of the network and the receiver. Basic mechanisms include:

  • Segments that fit into IP packets, into which the byte-stream is split by the sender,
  • a checksum for each segment,
  • a Window, which bounds the amount of data "in flight" between the sender and the receiver,
  • Acknowledgements, by which the receiver tells the sender about segments that were received successfully,
  • a flow-control mechanism, which regulates data rate as a function of network and receiver conditions.

Originally specified in September 1981 RFC 793, TCP was clarified, refined and extended in many documents. Perhaps most importantly, congestion control was introduced in "TCP Tahoe" in 1988, described in Van Jacobson's 1988 SIGCOMM article on "Congestion Avoidance and Control". It can be said that TCP's congestion control is what keeps the Internet working when links are overloaded. In today's Internet, the enhanced "Reno" variant of congestion control is probably the most widespread.

RFC 7323 (formerly RFC 1323) specifies a set of "TCP Extensions for High Performance", namely the Window Scaling Option, which provides for much larger windows than the original 64K, the Timestamp Option and the PAWS (Protection Against Wrapped Sequence numbers) mechanism. These extensions are supported by most contemporary TCP stacks, although they frequently must be activated explicitly (or implicitly by configuring large TCP windows).

Another widely implemented performance enhancement to TCP are selective acknowledgements (SACK, RFC 2018). In TCP as originally specified, the acknowledgements ("ACKs") sent by a receiver were always "cumulative", that is, they specified the last byte of the part of the stream that was completely received. This is advantageous with large TCP windows, in particular where chances are high that multiple segments within a single window are lost. RFC 2883 describes an extension to SACK which makes TCP more robust in the presence of packet reordering in the network.

In addition, RFC 3449 - TCP Performance Implications of Network Path Asymmetry provides excellent information since a vast majority of Internet connections are asymmetrical.

Further Reading

References

  • RFC 7414, A Roadmap for Transmission Control Protocol (TCP) Specification Documents]], M. Duke, R. Braden, W. Eddy, E. Blanton, A. Zimmermann, February 2015
  • RFC 793, Transmission Control Protocol]], J. Postel, September 1981
  • draft-ietf-tcpm-rfc793bis-10, Transmission Control Protocol Specification, Wesley M. Eddy, July 2018
  • RFC 7323, TCP Extensions for High Performance, D. Borman, B. Braden, V. Jacobson, B. Scheffenegger, September 2014 (obsoletes RFC 1323 from May 1993)
  • RFC 2018, TCP Selective Acknowledgement Options]], M. Mathis, J. Mahdavi, S. Floyd, A. Romanow, October 1996
  • RFC 5681, TCP Congestion Control]], M. Allman, V. Paxson, E. Blanton, September 2009
  • RFC 3449, TCP Performance Implications of Network Path Asymmetry]], H. Balakrishnan, V. Padmanabhan, G. Fairhurst, M. Sooriyabandara, December 2002

There is ample literature on TCP, in particular research literature on its performance and large-scale behaviour.

Juniper has a nice white paper, Supporting Differentiated Service Classes: TCP Congestion Control Mechanisms (PDF format) explaining TCP's congestion control as well as many of the enhancements proposed over the years.

-- Simon Leinen - 2004-10-27 - 2018-07-02 -- Ulrich Schmid - 2005-05-31

Window-Based Transmission

TCP is a sliding-window protocol. The receiver tells the sender the available buffer space at the receiver (TCP header field "window"). The total window size is the minimum of sender buffer size, advertised receiver window size and congestion window size.

The sender can transmit up to this amount of data before having to wait for further buffer update from the receiver and should not have more than this amount of data in transit in the network. The sender must buffer the sent data until it has been ACKed by the receiver, so that the data can be retransmitted if neccessary. For each ACK the sent segment left the window and a new segment fills the window if it fits the (possibly updated) window buffer.

Due to TCP's flow control mechanism, TCP window size can limit the maximum theoretical throughput regardless of the bandwidth of the network path. Using too small a TCP window can degrade the network performance lower than expected and a too large window may have the same problems in case of error recovery.

The TCP window size is the most important parameter for achieving maximum throughput across high-performance networks. To reach the maximum transfer rate, the TCP window should be no smaller than the bandwidth-delay product.

Window size (bytes) => Bandwidth (bytes/sec) x Round-trip time (sec)

Example:

window size: 8192 bytes
round-trip time: 100ms
maximum throughput: < 0.62 Mbit/sec.

References

<--   * Calculating TCP goodput, Javascript-based on-line calculator by V. Reijs, https://webmail.heanet.ie/~vreijs/tcpcalculations.htm -->

-- UlrichSchmid & SimonLeinen - 31 May-07 Jun 2005

Large TCP Windows

In order to achieve high data rates with TCP over "long fat networks", i.e. network paths with a large bandwidth-delay product, TCP sinks (that is, hosts receiving data transported by TCP) must advertise a large TCP receive window (referred to as just 'the window', since there is not an equivalent advertised 'send window').

The window is a 16 bit value (bytes 15 and 16 in the TCP header) and so, in TCP as originally specified, it is limited to a value of 65535 (64K). The receive window sets an upper limit on the sustained throughput achieveable over a TCP connection since it represents the maximum amount of unacknowledged data (in bytes) there can be on the TCP path. Mathematically, achieveable throughput can never be more than WINDOW_SIZE/RTT, so for a trans-Atlantic link, with say an RTT (Round trip Time) of 150ms, the throughput is limited to a maximum of 3.4Mbps. With the emergence of "long fat networks", the limit of 64K bytes (on some systems even just 32K bytes!) was clearly insufficient and so RFC 7323 laid down (amongst other things) a way of scaling the advertised window, such that the 16-bit window value can represent numbers larger than 64K.

RFC 7323 extensions

RFC 7323, TCP Extensions for High Performance, (and formerly RFC 1323) defines several mechanisms to enable high-speed transfers over LFNs: Window Scaling, TCP Timestamps, and Protection Against Wrapped Sequence numbers (PAWS).

The TCP window scaling option increases the maximum window size fom 64KB to 1Gbyte, by shifting the window field left by up to 14. The window scale option is used only during the TCP 3-way handshake (both sides must set the window scale option in their SYN segments if scaling is to be used in either direction).

It is important to use TCP timestamps option with large TCP windows. With the TCP timestamps option, each segment contains a timestamp. The receiver returns that timestamp in each ACK and this allows the sender to estimate the RTT. On the other hand with the TCP timestamps option the problem of wrapped sequence number could be solved (PAWS - Protection Against Wrapped Sequences) which could occur with large windows.

(Auto) Configuration

In the past, most operating systems required manual tuning to use large TCP windows. The OS-specific tuning section contains information on how to to this for a set of operating systems.

Since around 2008-2010, many popular operating systems will use large windows and the necessary protocol options by default, thanks to TCP Buffer Auto-Tuning.

Can TCP Windows ever be too large?

There are several potential issues when TCP Windows are larger than necessary:

  1. When there are many active TCP connection endpoints (sockets) on a system - such as a popular Web or file server - then a large TCP window size will lead to high consumption of system (kernel) memory. This can have a number of negative consequences: The system may run out of buffer space so that no new connections can be opened, or the high occupation of kernel memory (which typically must reside in actual RAM and cannot be "paged out" to disk) can "starve" other processes of access to fast memory (cache and RAM)
  2. Large windows can cause large "bursts" of consecutive segments/packets. When there is a bottleneck in the path - because of a slower link or because of cross-traffic - these bursts will fill up buffers in the network device (router or switch) in front of that bottleneck. The larger these bursts, the higher are the risks that this buffer overflows and causes multiple segments to be dropped. So a large window can lead to "sawtooth" behavior and worse link utilisation than with a just-big-enough window where TCP could operate at a steady rate.

Several methods for automatic TCP buffer tuning have been developed to resolve these issues, some of which have been implemented in (at least) recent Linux versions.

References

-- SimonLeinen - 27 Oct 2004 - 27 Sep 2014

Added:
>
>

TCP Window Scaling Option

TCP as originally specified only allows for windows up to 65536 bytes, because the window-related protocol fields all have a width of 16 bits. This prevents the use of large TCP windows, which are necessary to achieve high transmission rates over paths with a large bandwidth*delay product.

The Window Scaling Option is one of several options defined in RFC 7323, "TCP Extensions for High Performance". It allows TCP to advertise and use a window larger than 65536 bytes. The way this works is that in the initial handshake, the a TCP speaker announces a scaling factor, which is a power of two between 20 (no scaling) and 214, to allow for an effective window of 230 (one Gigabyte). Window Scaling only comes into effect when both ends of the connection advertise the option (even with just a scaling factor of 20).

As explained under the large TCP windows topic, Window Scaling is typically used in connection with the TCP Timestamp option and the PAWS (Protection Against Wrapped Sequence numbers), both defined in RFC 7323.

32K limitations on some systems when Window Scaling is not used

Some systems, notably Linux, limit the usable window in the absence of Window Scaling to 32 Kilobytes. The reasoning behind this is that some (other) systems used in the past were broken by larger windows, because they erroneously interpreted the 16-bit values related to TCP windows as signed, rather than unsigned, integers.

This limitation can have the effect that users (or applications) that request window sizes between 32K and 64K, but don't have Window Scaling enabled, do not actually benefit from the desired window sizes.

The artificial limitation was finally removed from the Linux kernel sources on March 21, 2006. The old behavior (window limited to 32KB when Window Scaling isn't used) can still be enabled through a sysctl tuning parameter, in case one really wants to interoperate with TCP implementations that are broken in the way described above.

Problems with window scaling

Middlebox issues

MiddleBoxes which do not understand window scaling may cause very poor performance, as described in WindowScalingProblems. Windows Vista limits the scaling factor for HTTP transactions to 2 to avoid some of the problems, see WindowsOSSpecific.

Diagnostic issues

The Window Scaling option has an impact on how the offered window in TCP ACKs should be interpreted. This can cause problems when one is faced with incomplete packet traces that lack the initial SYN/SYN+ACK handshake where the options are negotiated: The SEQ/ACK sequence may look incorrect because scaling cannot be taken into account. One effect of this is that tcpgraph or similar analysis tools (included in tools such as WireShark) may produce incoherent results.

References

  • RFC 7323, TCP Extensions for High Performance, D. Borman, B. Braden, V. Jacobson, B. Scheffenegger, September 2014 (obsoletes RFC 1323 from May 1993)
  • git note about the Linux kernel modification that got rid of the 32K limit, R. Jones and D.S. Miller, March 2006. Can also be retrieved as a patch.

-- SimonLeinen - 21 Mar 2006 - 27 September 2014
-- PekkaSavola - 07 Nov 2006
-- AlexGall - 31 Aug 2007 (added reference to scaling limitiation in Windows Vista)

 

TCP Flow Control

Note: This topic describes the Reno enhancement of classical "Van Jacobson" or Tahoe congestion control. There have been many suggestions for improving this mechanism - see the topic on high-speed TCP variants.

TCP flow control and window size adjustment is mainly based on two key mechanism: Slow Start and Additive Increase/Multiplicative Decrease (AIMD), also known as Congestion Avoidance. (RFC 793 and RFC 5681)

Slow Start

To avoid that a starting TCP connection floods the network, a Slow Start mechanism was introduced in TCP. This mechanism effectively probes to find the available bandwidth.

In addition to the window advertised by the receiver, a Congestion Window (cwnd) value is used and the effective window size is the lesser of the two. The starting value of the cwnd window is set initially to a value that has been evolving over the years, the TCP Initial Window. After each acknowledgment, the cwnd window is increased by one MSS. By this algorithm, the data rate of the sender doubles each round-trip time (RTT) interval (actually, taking into account Delayed ACKs, rate increases by 50% every RTT). For a properly implemented version of TCP this increase continues until:

  • the advertised window size is reached
  • congestion (packet loss) is detected on the connection.
  • there is no traffic waiting to take advantage of an increased window (i.e. cwnd should only grow if it needs to)

When congestion is detected, the TCP flow-control mode is changed from Slow Start to Congestion Avoidance. Note that some TCP implementations maintain cwnd in units of bytes, while others use units of full-sized segments.

Congestion Avoidance

Once congestion is detected (through timeout and/or duplicate ACKs), the data rate is reduced in order to let the network recover.

Slow Start uses an exponential increase in window size and thus also in data rate. Congestion Avoidance uses a linear growth function (additive increase). This is achieved by introducing - in addition to the cwnd window - a slow start threshold (ssthresh).

As long as cwnd is less than ssthresh, Slow Start applies. Once ssthresh is reached, cwnd is increased by at most one segment per RTT. The cwnd window continues to open with this linear rate until a congestion event is detected.

When congestion is detected, ssthresh is set to half the cwnd (or to be strictly accurate, half the "Flight Size". This distinction is important if the implementation lets cwnd grow beyond rwnd (the receiver's declared window)). cwnd is either set to 1 if congestion was signalled by a timeout, forcing the sender to enter Slow Start, or to ssthresh if congestion was signalled by duplicate ACKs and the Fast Recovery algorithm has terminated. In either case, once the sender enters Congestion Avoidance, its rate has been reduced to half the value at the time of congestion. This multiplicative decrease causes the cwnd to close exponentially with each detected loss event.

Fast Retransmit

In Fast Retransmit, the arrival of three duplicate ACKs is interpreted as packet loss, and retransmission starts before the retransmission timer (RTO) expires.

The missing segment will be retransmitted immediately without going through the normal retransmission queue processing. This improves performance by eliminating delays that would suspend effective data flow on the link.

Fast Recovery

Fast Recovery is used to react quickly to a single packet loss. In Fast recovery, the receipt of 3 duplicate ACKs, while being taken to mean a loss of a segment, does not result in a full Slow Start. This is because obviously later segments got through, and hence congestion is not stopping everything. In fast recovery, ssthresh is set to half of the current send window size, the missing segment is retransmitted (Fast Retransmit) and cwnd is set to ssthresh plus three segments. Each additional duplicate ACK indicates that one segment has left the network at the receiver and cwnd is increased by one segment to allow the transmission of another segment if allowed by the new cwnd. When an ACK is received for new data, cwmd is reset to the ssthresh, and TCP enters congestion avoidance mode.

References

-- UlrichSchmid - 07 Jun 2005 -- SimonLeinen - 27 Jan 2006 - 23 Jun 2011

High-Speed TCP Variants

There have been numerous ideas for improving TCP over the years. Some of those ideas have been adopted by mainstream operations (after thorough review). Recently there has been an uptake in work towards improving TCP's behavior with LongFatNetworks. It has been proven that the current congestion control algorithms limit the efficiency in network resource utilization. The various types of new TCP implementations introduce changes in defining the size of congestion window. Some of them require extra feedback from the network. Generally, all such congestion control protocols are divided as follows.

An orthogonal technique for improving TCP's performance is automatic buffer tuning.

Comparative Studies

All papers about individual TCP-improvement proposals contain comparisons against older TCP to quantify the improvement. There are several studies that compare the performance of the various new TCP variants, including

Other References

  • Gigabit TCP, G. Huston, The Internet Protocol Journal, Vol. 9, No. 2, June 2006. Contains useful descriptions of many modern TCP enhancements.
  • Faster, G. Huston, June 2005. This article from ISP Column looks at various approaches to TCP transfers at very high rates.
  • Congestion Control in the RFC Series, M. Welzl, W. Eddy, July 2006, Internet-Draft (work in progress)
  • ICCRG Wiki, IRTF (Internet Research Task Force) ICCRG (Internet Congestion Control Research Group), includes bibliography on congestion control.
  • RFC 6077: Open Research Issues in Internet Congestion Control, D. Papadimitriou (Ed.), M. Welzl, M. Scharf, B. Briscoe, February 2011

Related Work

  • PFLDnet (International Workshop on Protocols for Fast Long-Distance Networks): 2003, 2004, 2005, 2006, 2007, 2009, 2010. (The 2008 pages seem to have been removed from the Net.)

-- Simon Leinen - 2005-05-28–2016-09-18 -- Orla Mc Gann - 2005-10-10 -- Chris Welti - 2008-02-25

Explicit Congestion Protocol (XCP)

Explicit Congestion Protocol (XCP) is a new congestion control protocol developed by Dina Katabi from MIT Computer Science & Artificial Intelligence Lab. It extracts information about congestion from routers along the path between endpoints. It is more complicated to implement than other proposed Internet congestion control protocols.

The XCP-capable routers make a fair per-flow bandwidth allocation without carrying per-flow congestion state in packets. In order to request for the desired throughput, the sender sends a congestion header (XCP packet) located between the IP and transport headers. It enables the sender to learn about the bottleneck on the path from the sender to the receiver in a single round trip.

In order to increase congestion window size for TCP connection, the sender require feedback from the network, informing about the maximum, available throughput along the path for injecting data into the network. The routers update such information in the congestion header as it moves from the sender to the receiver. The main task for the receiver is to copy the network feedback into outgoing packets belonging to the same bidirectional flow.

The congestion header consists of four fields as follows:

  • Round-Trip Time (RTT): The current round-trip time for the flow.
  • Throughput: The throughput used currently by the sender.
  • Delta_Throughput: The value by which the sender would like to change its throughput. This field is updated by the routers along the path. If one of the router sets a negative value, it means that the sender must slow down.
  • Reverse_Feedback: This is the value, which is copied from the Delta_Throughput field and returned by the receiver to the sender, it contains a maximum feedback allocation from the network.

The router with implemented XCP maintains two control algorithms, executed periodically. First of them, implemented in the congestion controller is responsible for specifying maximum use of the outbound port. The fairness controller is responsible for fairly distribution of throughput to flows sharing the link.

Benefits of XCP

  • it achieves fairness and rapidly converges to fairness,
  • it achieves maximum link utilization (better bottleneck link utilization),
  • it learns about the bottleneck much more rapidly,
  • it is potentially applicable to any transport protocol,
  • is more stable at long-distance connections with larger RTT.

Deployment issues and issues with XCP

  • has to be tested in real network environments,
  • it should describe solutions for tunneling protocols,
  • has to be supported by routers along the path,
  • the work load for the routers is higher.

References

-- WojtekSronek - 08 Jul 2005

Revision 82006-03-21 - SimonLeinen

Line: 1 to 1
 
META TOPICPARENT name="AutoToc"

Unrolled PERTKB Contents

Line: 65 to 65
 

QoSMetrics Boxes

This measurement infrastructure was built as part of RENATER's Métrologie efforts. The infrastructure performs continuous delay measurement between a mesh of measurement points on the RENATER backbone. The measurements are sent to a central server every minute, and are used to produce a delay matrix and individual delay histories (IPv4 and IPv6 measures, soon for each CoS)).

Please note that all results (tables and graphs) presented on this page are genereted by RENATER's scripts from QoSMetrics MySQL Database.

Screendump from RENATER site (http://pasillo.renater.fr/metrologie/get_qosmetrics_results.php)

new_Renater_IPPM_results_screendump.jpg

Example of asymmetric path:

(after that the problem was resolve, only one path has his hop number decreased)

Graphs:

Paris_Toulouse_delay Toulouse_Paris_delay
Paris_Toulouse_jitter Toulouse_Paris_jitter
Paris_Toulouse_hop_number Toulouse_Paris_hop_number
Paris_Toulouse_pktsLoss Toulouse_Paris_pktsLoss

Traceroute results:

traceroute to 193.51.182.xx (193.51.182.xx), 30 hops max, 38 byte packets
1 209.renater.fr (195.98.238.209) 0.491 ms 0.440 ms 0.492 ms
2 gw1-renater.renater.fr (193.49.159.249) 0.223 ms 0.219 ms 0.252 ms
3 nri-c-g3-0-50.cssi.renater.fr (193.51.182.6) 0.726 ms 0.850 ms 0.988 ms
4 nri-b-g14-0-0-101.cssi.renater.fr (193.51.187.18) 11.478 ms 11.336 ms 11.102 ms
5 orleans-pos2-0.cssi.renater.fr (193.51.179.66) 50.084 ms 196.498 ms 92.930 ms
6 poitiers-pos4-0.cssi.renater.fr (193.51.180.29) 11.471 ms 11.459 ms 11.354 ms
7 bordeaux-pos1-0.cssi.renater.fr (193.51.179.254) 11.729 ms 11.590 ms 11.482 ms
8 toulouse-pos1-0.cssi.renater.fr (193.51.180.14) 17.471 ms 17.463 ms 17.101 ms
9 xx.renater.fr (193.51.182.xx) 17.598 ms 17.555 ms 17.600 ms

[root@CICT root]# traceroute 195.98.238.xx
traceroute to 195.98.238.xx (195.98.238.xx), 30 hops max, 38 byte packets
1 toulouse-g3-0-20.cssi.renater.fr (193.51.182.202) 0.200 ms 0.189 ms 0.111 ms
2 bordeaux-pos2-0.cssi.renater.fr (193.51.180.13) 16.850 ms 16.836 ms 16.850 ms
3 poitiers-pos1-0.cssi.renater.fr (193.51.179.253) 16.728 ms 16.710 ms 16.725 ms
4 nri-a-pos5-0.cssi.renater.fr (193.51.179.17) 22.969 ms 22.956 ms 22.972 ms
5 nri-c-g3-0-0-101.cssi.renater.fr (193.51.187.21) 22.972 ms 22.961 ms 22.844 ms
6 gip-nri-c.cssi.renater.fr (193.51.182.5) 17.603 ms 17.582 ms 17.616 ms
7 250.renater.fr (193.49.159.250) 17.836 ms 17.718 ms 17.719 ms
8 xx.renater.fr (195.98.238.xx) 17.606 ms 17.707 ms 17.226 ms

-- FrancoisXavierAndreu & SimonLeinen - 06 Jun 2005 - 06 Jul 2005 - 07 Apr 2006

Passive Measurement Tools

Network Usage:

  • SNMP-based tools retrieve network information such as state and utilization of links, router CPU loads, etc.: MRTG, Cricket
  • Netflow-based tools use flow-based accounting information from routers for traffic analysis, detection of routing and security problems, denial-of-service attacks etc.

Traffic Capture and Analysis Tools

-- FrancoisXavierAndreu - 06 Jun 2005 -- SimonLeinen - 05 Jan 2006

SNMP-based tools

The Simple Network Management Protocol (SNMP) is widely used for device monitoring over IP, especially for monitoring infrastructure compoments of those networks, such as routers, switches etc. There are many tools that use SNMP to retrieve network information such as the state and utilization of links, routers' CPU load, etc., and generate various types of visualizations, other reports, and alarms.

  • MRTG (Multi-Router Traffic Grapher) is a widely-used open-source tool that polls devices every five minutes using SNMP, and plots the results over day, week, month, and year timescales.
  • Cricket serves the same purpose as MRTG, but is configured in a different way - targets are organized in a tree, which allows inheritance of target specifications. Cricket has been using RRDtool from the start, although MRTG has picked that up (as an option) as well.
  • RRDtool is a round-robin database used to store periodic measurements such as SNMP readings. Recent values are stored at finer time granularity than older values, and RRDtool incrementally "consolidates" values as they age, and uses an efficient constant-size representation. RRDtool is agnostic to SNMP, but it is used by Cricket, MRTG (optionally), and many other SNMP tools.
  • Synagon is a Python-based tool that performs SNMP measurements (Juniper firewall counters) at much shorter timescales (seconds) and graphs them in real time. This can be used to look at momentary link utilizations, within the limits of how often devices actually update their SNMP values.

-- FrancoisXavierAndreu - 06 Jun 2005 -- SimonLeinen - 05 Jan 2006 - 16 Jul 2007

Added:
>
>

Synagon SNMP Graphing Tool

Overview

Synagon is Python-based program that collects and graphs firewall values from Juniper routers. The tool first requires the user to select a router from a list. It will retrieve all the available firewall filters/counters from that router and the user can select any of these to graph. It was developed by DANTE for internal use, but could be made available on request (on a case by case basis), in an entirely unsupported manner. Please contact operations@dante.org.uk for more details.

[The information below is taken from DANTE's internal documentation.]

Installation and Setup procedures

Synagon comes as a single compressed archive, that can be copied in a directory and then invoked as any other standard python script.

It requires the following python libraries to operate:

1) The python SNMP framework (available at http://pysnmp.sourceforge.net)

The SNMP framework is required to SNMP to the routers. For guidance how to install the packages please see the section below

You also need to create a network.py file that describes the network topology where the tool will act upon. The configuration files should be located in the same directory as the tool itself. Please look at the section below for more information on the format of the file.

  • Example Sysngaon screen:
    Example Sysngaon screen

How to use Synagon

The tool can be invoked by double clicking on the file name (under windows) or by typing the following on the command shell:

python synagon.py

Prior to that, a network.py configuration file must have been created and located at the directory where the tool has been invoked. It describes the location of the routers and the snmp community to use when contacting them.

Immediately after, a window with a single menu entry will appear. By clicking on the menu, a list of all routers described in the network.py will be given as submenus.

Under each router, a submenu shows the list of firewall filters for that router (depending whether you have previously collected the list) together with a ‘collect’ submenu. The collect submenu initiates the collection of the firewall filters from the router and then updates the list of firewall filters in the submenu where the collect button is located. The list of firewall filters of each router is also stored on the local disk and next time the tool is invoked it won’t be necessary to interrogate the router for receiving the firewall list, but they will be taken instead from the file. When the firewall filter list is retrieved, a submenu with all counter and policers for each filter will be displayed. A policer is marked with [p] and a counter with a [c] just after their name.

When a counter or policer of a firewall filter of a router is selected, the SNMP poling for that entity begins. After a few seconds, a window will appear that plots the requested values.

The graph shows the rate of bytes dropped since monitoring had started for a particular counter or the rate of packets dropped since monitor