Difference: BasicGuide (1 vs. 6)

Revision 62017-01-31 - KurtBaumann

Line: 1 to 1
 
META TOPICPARENT name="WebHome"

DS3.3.3a - Basic Guide to Network Performance

Revision 52006-11-17 - PekkaSavola

Line: 1 to 1
 
META TOPICPARENT name="WebHome"

DS3.3.3a - Basic Guide to Network Performance

Line: 48 to 48
 

Operating-System Specific Configuration Hints

This topic points to configuration hints for specific operating systems.

Note that the main perspective of these hints is how to achieve good performance in a big bandwidth-delay-product environment, typically with a single stream. See the ServerScaling topic for some guidance on tuning servers for many concurrent connections.

-- SimonLeinen - 27 Oct 2004 - 02 Nov 2008
-- PekkaSavola - 02 Oct 2008

OS-Specific Configuration Hints: BSD Variants

Performance tuning

By default, earlier BSD versions use very modest buffer sizes and don't enable Window Scaling by default. See the references below how to do so.

In contrast, FreeBSD 7.0 introduced TCP buffer auto-tuning, and thus should provide good TCP performance out of the box even over LFNs. This release also implements large-send offload (LSO) and large-receive offload (LRO) for some Ethernet adapters. FreeBSD 7.0 also announces the following in its presentation of technological advances:

10Gbps network optimization: With optimized device drivers from all major 10gbps network vendors, FreeBSD 7.0 has seen extensive optimization of the network stack for high performance workloads, including auto-scaling socket buffers, TCP Segment Offload (TSO), Large Receive Offload (LRO), direct network stack dispatch, and load balancing of TCP/IP workloads over multiple CPUs on supporting 10gbps cards or when multiple network interfaces are in use simultaneously. Full vendor support is available from Chelsio, Intel, Myricom, and Neterion.

Recent TCP Work

The FreeBSD Foundation has granted a project "Improvements to the FreeBSD TCP Stack" to Lawrence Stewart at Swinburne University. Goals for this project include support for Appropriate Byte Counting (ABC, RFC 3465), merging SIFTR into the FreeBSD codebase, and improving the implementation of the reassembly queue. Information is available on http://caia.swin.edu.au/urp/newtcp/. Other improvements done by this group are support for modular congestion control, implementations of CUBIC, H-TCP, TCP Vegas, the SIFTR TCP implementation tool, and a testing framework including improvements to iperf for better buffer size control.

The CUBIC implementation for FreeBSD was announced on the ICCRG mailing list in September 2009. Currently available as patches for the 7.0 and 8.0 kernels, it is planned to be merged "into the mainline FreeBSD source tree in the not too distant future".

In February 2010, a set of Software for FreeBSD TCP R&D was announced by the Swinburne group: This includes modular TCP congestion control, Hamilton TCP (H-TCP), a newer "Hamilton Delay" Congestion Control Algorithm v0.1, Vegas Congestion Control Algorithm v0.1, as well as a kernel helper/hook framework ("Khelp") and a module (ERTT) to improve RTT estimation.

Another release in August 2010 added a "CAIA-Hamilton Delay" congestion control algorithm as well as revised versions of the other components.

QoS tools

On BSD systems ALTQ implements a couple of queueing/scheduling algorithms for network interfaces, as well as some other QoS mechanisms.

To use ALTQ on a FreeBSD 5.x or 6.x box, the following are the necessary steps:

  1. build a kernel with ALTQ
    • options ALTQ and some others beginning with ALTQ_ should be added to the kernel config. Please refer to the ALTQ(4) man page.
  2. define QoS settings in /etc/pf.conf
  3. use pfctl to apply those settings
    • Set pf_enable to YES in /etc/rc.conf (set as well the other variables related to pf according to your needs) in order to apply the QoS settings every time the host boots.

References

-- SimonLeinen - 27 Jan 2005 - 18 Aug 2010
-- PekkaSavola - 17 Nov 2006

Linux-Specific Network Performance Tuning Hints

Linux has its own implementation of the TCP/IP Stack. With recent kernel versions, the TCP/IP implementation contains many useful performance features. Parameters can be controlled via the /proc interface, or using the sysctl mechanism. Note that although some of these parameters have ipv4 in their names, they apply equally to TCP over IPv6.

A typical configuration for high TCP throughput over paths with high bandwidth*delay product would include the following in /etc/sysctl.conf:

A description of each parameter listed below can be found in section Linux IP Parameters.

Basic tuning

TCP Socket Buffer Tuning

See the EndSystemTcpBufferSizing topic for general information about sizing TCP buffers.

Since 2.6.17 kernel, buffers have sensible automatically calculated values for most uses. Unless very high RTT, loss or performance requirement (200+ Mbit/s) is present, buffer settings may not need to be tuned at all.

Nonetheless, the following values may be used:

net/core/rmem_max=16777216
net/core/wmem_max=16777216
net/ipv4/tcp_rmem="8192 87380 16777216"
net/ipv4/tcp_wmem="8192 65536 16777216"

With kernel < 2.4.27 or < 2.6.7, receive-side autotuning may not be implemented, and the default (middle value) should be increased (at the cost of higher, by-default memory consumption):

net/ipv4/tcp_rmem="8192 16777216 16777216"

NOTE: If you have a server with hundreds of connections, you might not want to use a large default value for TCP buffers, as memory may quickly run out smile

There is a subtle but important implementation detail in the socket buffer management of Linux. When setting either the send- or receive buffer sizes via the SO_SNDBUF and SO_RCVBUF socket options via setsockopt(2), the value passed in the system call is doubled by the kernel to accomodate buffer management overhead. Reading the values back with getsockopt(2) return this modified value, but the effective buffer available to TCP payload is still the original value.

The values net/core/rmem and net/core/wmem apply to the argument to setsockopt(2).

In contrast, the maximum values of net/ipv4/tcp_rmem=/=net/ipv4/tcp_wmem apply to the total buffer sizes including the factor of 2 for the buffer management overhead. As a consequence, those values must be chosen twice as large as required by a particular BandwidthDelayProduct. Also note taht the values net/core/rmem and net/core/wmem do not apply to the TCP autotuning mechanism.

Interface queue lengths

InterfaceQueueLength describes how to adjust interface transmit and receive queue lengths. This tuning is typically needed with GE or 10GE transfers.

Host/adapter architecture implications

When going for 300 Mbit/s performance, it is worth verifying that host architecture (e.g., PCI bus) is fast enough. PCI Express is usually fast enough to no longer be the bottleneck in 1Gb/s and even 10Gb/s applications.

For the older PCI/PCI-X buses, when going for 2+ Gbit/s performance, the Maximum Memory Read Byte Count (MMRBC) usually needs to be increased using setpci.

Many network adapters support features such as checksum offload. In some cases, however, these may even decrease performance. In particular, TCP Segment Offload may need to be disabled, with:

ethtool -K eth0 tso off

Advanced tuning

Sharing congestion information across connections/hosts

2.4 series kernels have a TCP/IP weakness in that their interface buffers' maximum window size is based on the experience of previous connections - if you have loss at any point (or a bad end host at the same route) you limit your future TCP connections. So, you may have to flush the route cache to improve performance.

net.ipv4.route.flush=1

2.6 kernels also remember some performance characteristics across connections. In benchmarks and other tests, this might not be desirable.

# don't cache ssthresh from previous connection
net.ipv4.tcp_no_metrics_save=1

Other TCP performance variables

If there is packet reordering in the network, reordering could end up being interpreted as a packet loss too easily. Increasing tcp_reordering parameter might help in that case:

net/ipv4/tcp_reordering=20   # (default=3)

Several variables already have good default values, but it may make sense to check that these defaults haven't been changed:

net/ipv4/tcp_timestamps=1
net/ipv4/tcp_window_scaling=1
net/ipv4/tcp_sack=1
net/ipv4/tcp_moderate_rcvbuf=1

TCP Congestion Control algorithms

Linux 2.6.13 introduced pluggable congestion modules, which allows you to select one of the high-speed TCP congestion control variants, e.g. CUBIC

net/ipv4/tcp_congestion_control = cubic

Alternative values include highspeed (HS-TCP), scalable (Scalable TCP), htcp (Hamilton TCP), bic (BIC), reno ("Reno" TCP), and westwood (TCP Westwood).

Note that on Linux 2.6.19 and later, CUBIC is already used as the default algorithm.

Web100 kernel tuning

If you are using a web100 kernel, the following parameters seem to improve networking performance even further:

# web100 tuning
# turn off using txqueuelen as part of congestion window computation
net/ipv4/WAD_IFQ = 1

QoS tools

Modern Linux kernels have flexible traffic shaping built in.

See the Linux traffic shaping example for an illustration of how these mechanisms can be used to solve a real performance problem.

References

See also the reference to Congestion Control in Linux in the FlowControl topic.

-- SimonLeinen - 27 Oct 2004 - 23 Jul 2009
-- ChrisWelti - 27 Jan 2005
-- PekkaSavola - 17 Nov 2006
-- AlexGall - 28 Nov 2016

Changed:
<
<
Warning: Can't find topic PERTKB.TxQueueLength
>
>

Interface queue lengths

txqueuelen

The txqueuelen parameter of an interface in the Linux kernel. It limits the number of packets in the transmission queue in the interface's device driver. In 2.6 series and e.g., RHEL3 (2.4.21) kernel, the default is 1000. In earlier kernels, the default is 100. '100' is often too low to support line-rate transfers over Gigabit Ethernet interfaces, and in some cases, even '1000' is too low.

For Gigabit Ethernet interfaces, it is suggested to use at least a txqueuelen of 1000. (values of up to 8000 have been used successfully to further improve performance), e.g.,

ifconfig eth0 txqueuelen 1000

If a host is low performance or has slow links, having too big txqueuelen may disturb interactive performance.

netdev_max_backlog

The sysctl net.core.netdev_max_backlog defines the queue sizes for received packets. In recent kernels (like 2.6.18), the default is 1000; in older ones, it is 300. If the interface receives packets (e.g., in a burst) faster than the kernel can process them, this could overflow. A value in the order of thousands should be reasonable for GE, tens of thousands for 10GE.

For example,

net/core/netdev_max_backlog=2500

References

-- SimonLeinen - 06 Jan 2005
-- PekkaSavola - 17 Nov 2006

 

OS-Specific Configuration Hints: Mac OS X

As Mac OS X is mainly a BSD derivative, you can use similar mechanisms to tune the TCP stack - see under BsdOSSpecific.

TCP Socket Buffer Tuning

See the EndSystemTcpBufferSizing topic for general information about sizing TCP buffers.

For testing temporary improvements, you can directly use sysctl in a terminal window: (you have to be root to do that)

sysctl -w kern.ipc.maxsockbuf=8388608
sysctl -w net.inet.tcp.rfc1323=1
sysctl -w net.inet.tcp.sendspace=1048576
sysctl -w net.inet.tcp.recvspace=1048576
sysctl -w kern.maxfiles=65536
sysctl -w net.inet.udp.recvspace=147456
sysctl -w net.inet.udp.maxdgram=57344
sysctl -w net.local.stream.recvspace=65535
sysctl -w net.local.stream.sendspace=65535

For permanent changes that last over a reboot, insert the appropriate configurations into Ltt>/etc/sysctl.conf. If this file does not exist must create it. So, for the above, just add the following lines to sysctl.conf:

kern.ipc.maxsockbuf=8388608
net.inet.tcp.rfc1323=1
net.inet.tcp.sendspace=1048576
net.inet.tcp.recvspace=1048576
kern.maxfiles=65536
net.inet.udp.recvspace=147456
net.inet.udp.maxdgram=57344
net.local.stream.recvspace=65535
net.local.stream.sendspace=65535

Note This only works for OSX 10.3 or later! For earlier versions you need to use /etc/rc where you can enter whole sysctl commands.

Users that are unfamiliar with terminal windows can also use the GUI tool "TinkerTool System" and use its Network Tuning option to set the TCP buffers.

TinkerTool System is available from:

-- ChrisWelti - 30 Jun 2005

OS-Specific Configuration Hints: Solaris (Sun Microsystems)

Planned Features

Pluggable Congestion Control for TCP and SCTP

A proposed OpenSolaris project foresees the implementation of pluggable congestion control for both TCP and SCTP. HS-TCP and several other congestion control algorithms for OpenSolaris. This includes implementation of the HighSpeed, CUBIC, Westwood+, and Vegas congestion control algorithms, as well as ipadm subcommands and socket options to get and set congestion control parameters.

On 15 December 2009, Artem Kachitchkine posted an initial draft of a work-in-progress design specification for this feature has been announced on the OpenSolaris networking-discuss forum. According to this proposal, the API for setting the congestion control mechanism for a specific TCP socket will be compatible with Linux: There will be a TCP_CONGESTION socket option to set and retrieve a socket's congestion control algorithm, as well as a TCP_INFO socket option to retrieve various kinds of information about the current congestion control state.

The entire plugabble-congestion control mechanism will be implemented for SCTP in addition to TCP. For example, there will also be an SCTP_CONGESTION socket option. Note that congestion control in SCTP is somewhat trickier than in TCP, because a single SCTP socket can have multiple underlying paths through SCTP's "multi-homing" feature. Congestion control state must be kept separately for each path (address pair). This also means that there is no direct SCTP equivalent to TCP_INFO. The current proposal adds a subset of TCP_INFO's information to the result of the existing SCTP_GET_PEER_ADDR_INFO option for getsockopt().

The internal structure of the code will be somewhat different to what is in the Linux kernel. In particular, the general TCP code will only make calls to the algorithm-specific congestion control modules, not vice versa. The proposed Solaris mechanism also contains ipadm properties that can be used to set the default congestion control algorithm either globally or for a specific zone. The proposal also suggests "observability" features; for example, pfiles output should include the congestion algorithm used for a socket, and there are new kstat statistics that count certain congestion-control events.

Useful Features

TCP Multidata Transmit (MDT, aka LSO)

Solaris 10, and Solaris 9 with patches, supports TCP Multidata Transmit (MDT), which is Sun's name for (software-only) Large Send Offload (LSO). In Solaris 10, this is enabled by default, but in Solaris 9 (with the required patches for MDT support), the kernel and driver have to be reconfigured to be able to use MDT. See the following pointers for more information from docs.sun.com:

Solaris 10 "FireEngine"

The TCP/IP stack in Solaris 10 has been largely rewritten from previous versions, mostly to improve performance. In particular, it supports Interrupt Coalescence, integrates TCP and IP more closely in the kernel, and provides multiprocessing enhancements to distribute work more efficiently over multiple processors. Ongoing work includes UDP/IP integration for better performance of UDP applications, and a new driver architecture that can make use of flow classification capabilities in modern network adapters.

Solaris 10: New Network Device Driver Architecture

Solaris 10 introduces GLDv3 (project "Nemo"), a new driver architecture that generally improves performance, and adds support for several performance features. Some, but not all, Ethernet device drivers were ported over to the new architecture and benefit from those improvements. Notably, the bge driver was ported early, and the new "Neptune" adapters ("multithreaded" dual-port 10GE and four-port GigE with on-board connection de-multiplexing hardware) used it from the start.

Darren Reed has posted a small C program that lists the active acceleration features for a given interface. Here's some sample output:

$ sudo ./ifcapability
lo0 inet
bge0 inet +HCKSUM(version=1 +full +ipv4hdr) +ZEROCOPY(version=1 flags=0x1) +POLL
lo0 inet6
bge0 inet6 +HCKSUM(version=1 +full +ipv4hdr) +ZEROCOPY(version=1 flags=0x1)

Displaying and setting link parameters with dladm

Another OpenSolaris project called Brussels unifies many aspects of network driver configuration under the dladm command. For example, link MTUs (for "Jumbo Frames") can be configured using

dladm set-linkprop -p mtu=9000 web1

The command can also be used to look at current physical parameters of interfaces:

$ sudo dladm show-phys
LINK         MEDIA                STATE      SPEED DUPLEX   DEVICE
bge0         Ethernet             up         1000 full      bge0

Note that Brussels is still being integrated into Solaris. Driver support was added since SXCE (Solaris Express Community Edition) build 83 for some types of adapters. Eventually this should be integrated into regular Solaris releases.

Setting TCP buffers

# To increase the maximum tcp window
# Rule-of-thumb: max_buf = 2 x cwnd_max (congestion window)
ndd -set /dev/tcp tcp_max_buf 4194304
ndd -set /dev/tcp tcp_cwnd_max 2097152

# To increase the DEFAULT tcp window size
ndd -set /dev/tcp tcp_xmit_hiwat 65536
ndd -set /dev/tcp tcp_recv_hiwat 65536

Pitfall when using asymmetric send and receive buffers

The documented default behaviour (tunable TCP parameter tcp_wscale_always = 0) of Solaris is to include the TCP window scaling option in an initial SYN packet when either the send or the receive buffer is larger than 64KiB. From the tcp(7P) man page:

          For all applications, use ndd(1M) to modify the  confi-
          guration      parameter      tcp_wscale_always.      If
          tcp_wscale_always is set to 1, the window scale  option
          will  always be set when connecting to a remote system.
          If tcp_wscale_always is 0, the window scale option will
          be set only if the user has requested a send or receive
          window  larger  than  64K.   The   default   value   of
          tcp_wscale_always is 0.

However, Solaris 8, 9 and 10 do not take the send window into account. This results in an unexpected behaviour for a bulk transfer from node A to node B when the bandwidth-delay product is larger than 64KiB and

  • A's receive buffer (tcp_recv_hiwat) < 64KiB
  • A's send buffer (tcp_xmit_hiwat) > 64KiB
  • B's receive buffer > 64KiB

A will not advertize the window scaling option and B will not do so either according to RFC 1323. As a consequence, throughput will be limited by a congestion window of 64KiB.

As a workaround, the window scaling option can be forcibly advertised by setting

# ndd -set /dev/tcp tcp_wscale_always 1

A bug report has been filed with Sun Microsystems.

References

-- ChrisWelti - 11 Oct 2005, added section for setting default and maximum TCP buffers
-- AlexGall - 26 Aug 2005, added section on pitfall with asymmetric buffers
-- SimonLeinen - 27 Jan 2005 - 16 Dec 2009

Windows-Specific Host Tuning

Next Generation TCP/IP Stack in Windows Vista / Windows Server 2008 / Windows 7

According to http://www.microsoft.com/technet/community/columns/cableguy/cg0905.mspx, the new version of Windows, Vista, features a redesigned TCP/IP stack. Besides unifying IPv4/IPv6 dual-stack support, this new stack also claims much-improved performance for high-speed and asymmetric networks, as well as several auto-tuning features.

The Microsoft Windows Vista Operating System enables the TCP Window Scaling option by default (previous Windows OSes had this option disabled). This causes problems with various middleboxes, see WindowScalingProblems. As a consequence, the scaling factor is limited to 2 for HTTP traffic in Vista.

Another new feature of Vista is Compound TCP, a high-performance TCP variant that uses delay information to adapt its transmission rate (in addition to loss information). On Windows Server 2008 it is enabled by default, but it is disabled in Vista. You can enable it in Vista by running the following command as an Administrator on the command line.

netsh interface tcp set global congestionprovider=ctcp
(see CompoundTCP for more information).

Although the new TCP/IP-Stack has integrated auto-tuning for receive buffers, the TCP send buffer still seems to be limited to 16KB by default. This means you will not be able to get good upload rates for connections with higher RTTs. However, using the registry hack (see further below) you can manually change the default value to something higher.

Performance tuning for earlier Windows versions

Enabling TCP Window Scaling

The references below detail the various "whys" and "hows" to tune Windows network performance. Of particular note however is the setting of scalable TCP receive windows (see LargeTcpWindows).

It appears that, by default, not only does Windows not support TCP 1323 scalable windows, but the required key is not even in the Windows registry. The key (Tcp1323Opts) can be added to one of two places, depending on the particular system:

[HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\VxD\MSTCP] (Windows'95)
[HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\Tcpip\Parameters] (Windows 2000/XP)

Value Name: Tcp1323Opts
Data Type: REG_DWORD (DWORD Value)
Value Data: 0, 1, 2 or 3

  • 0 = disable RFC 1323 options
  • 1 = window scale enabled only
  • 2 = time stamps enabled only
  • 3 = both options enabled

This can also be adjusted using DrTCP GUI tool (see link below)

TCP Buffer Sizes

See the EndSystemTcpBufferSizing topic for general information about sizing TCP buffers.

Buffers for TCP Send Windows

Inquiry at Microsoft (thanks to Larry Dunn) has revealed that the default send window is 8KB and that there is no official support for configuring a system-wide default. However, the current Winsock implementation uses the following undocumented registry key for this purpose

[HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\AFD\Parameters]

Value Name: DefaultSendWindow
Data Type: REG_DWORD (DWORD Value)
Value Data: The window size in bytes. The maximum value is unknown.

According to Microsoft, this parameter may not be supported in future Winsock releases. However, this parameter is confirmed to be working with Windows XP, Windows 2000, Windows Vista and even the Windows 7 Beta 1.

This value needs to be manually adjusted (e.g., DrTCP can't do it), or the application needs to set it (e.g., iperf with '-l' option). It may be difficult to detect this as a bottleneck because the host will advertise large windows but will only have about 8.5KB of data in flight.

Buffers for TCP Receive Windows

(From this TechNet article

Assuming Window Scaling is enabled, then the following parameter can be set to between 1 and 1073741823 bytes

[HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\Tcpip\Parameters\TcpWindowSize]

Value Name: TcpWindowSize
Data Type: REG_DWORD (DWORD Value)
Value Data: The window size in bytes. 1 to 65535 (or 1 to 1073741823 if Window scaling set).

DrTCP can also be used to adjust this parameter.

Windows XP SP2/SP3 simultaneous connection limit

Windows XP SP2 and SP3 limits the number of simultaneous TCP connection attempts (so called half-open state) to 10 (by default). How exactly it's doing this is not clear, but this may come up with some kind of applications. Especially file sharing applications like bittorrent might suffer from this behaviour. You can see yourself if your performance is affected by looking into your event logs:

EventID 4226: TCP/IP has reached the security limit imposed on the number of concurrent TCP connect attempts

A patcher has been developed to adjust this limit.

References

-- TobyRodwell - 28 Feb 2005 (initial version)
-- SimonLeinen - 06 Apr 2005 (removed dead pointers; added new ones; descriptions)
-- AlexGall - 17 Jun 2005 (added send window information)
-- HankNussbacher - 19 Jun 2005 (added new MS webcast for 2003 Server)
-- SimonLeinen - 12 Sep 2005 (added information and pointer about new TCP/IP in Vista)
-- HankNussbacher - 22 Oct 2006 (added Cisco IOS firewall upgrade)
-- AlexGall - 31 Aug 2007 (replaced IOS issue with a link to WindowScalingProblems, added info for scaling limitation in Vista)
-- SimonLeinen - 04 Nov 2007 (information on how to enable Compound TCP on Vista)
-- PekkaSavola - 05 Jun 2008 major reorganization to improve the clarity wrt send buffer adjustment, add a link to DrTCP; add XP SP3 connection limiting discussion -- ChrisWelti - 02 Feb 2009 added new information regarding the send window in windows vista / server 2008 / 7

Revision 42006-07-06 - SimonLeinen

Line: 1 to 1
 
META TOPICPARENT name="WebHome"

DS3.3.3a - Basic Guide to Network Performance

Line: 38 to 38
 

Packet Capture and Analysis Tools:

Detect protocol problems via the analysis of packets, trouble shooting

  • tcpdump: Packet capture tool based on the libpcap library. http://www.tcpdump.org/
  • snoop: tcpdump equivalent in Sun's Solaris operating system
  • Wireshark (formerly called Ethereal): Extended tcpdump with a user-friendly GUI. Its plugin architecture allowed many protocol-specific "dissectors" to be written. Also includes some analysis tools useful for performance debugging.
  • libtrace: A library for trace processing; works with libpcap (tcpdump) and other formats.
  • Netdude: The Network Dump data Displayer and Editor is a framework for inspection, analysis and manipulation of tcpdump trace files.
  • jnettop: Captures traffic coming across the host it is running on and displays streams sorted by the bandwidth they use.
  • tcptrace: a statistics/analysis tool for TCP sessions using tcpdump capture files
  • capturing packets with Endace DAG cards (dagsnap, dagconvert, ...)

General Hints for Taking Packet Traces

Capture enough data - you can always throw away stuff later. Note that tcpdump's default capture length is small (96 bytes), so use -s 0 (or something like -s 1540) if you are interested in payloads. Wireshark and snoop capture entire packets by default. Seemingly unrelated traffic can impact performance. For example, Web pages from foo.example.com may load slowly because of the images from adserver.example.net. In situations of high background traffic, it may however be necessary to filter out unrelated traffic.

It can be extremely useful to collect packet traces from multiple points in the network

  • On the endpoints of the communication
  • Near �suspicious� intermediate points such as firewalls

Synchronized clocks (e.g. by NTP) are very useful for matching traces.

Address-to-name resolution can slow down display and causes additional traffic which confuses the trace. With tcpdump, consider using -n or tracing to file (-w file).

Request remote packet traces

When a packet trace from a remote site is required, this often means having to ask someone at that site to provide it. When requesting such a trace, consider making this as easy as possible for the person having to do it. Try to use a packet tracing tool that is already available - tcpdump for most BSD or Linux systems, snoop for Solaris machines. Windows doesn't seem to come bundled with a packet capturing program, but you can direct the user to Wireshark, which is reasonably easy to install and use under Windows. Try to give clean indications on how to call the packet capture program. It is usually best to ask the user to capture to a file, and then send you the capture file as an e-mail attachment or so.

References

  • Presentation slides from a short talk about packet capturing techniques given at the network performance section of the January 2006 GEANT2 technical workshop

-- FrancoisXavierAndreu - 06 Jun 2005
-- SimonLeinen - 05 Jan 2006-09 Apr 2006
-- PekkaSavola - 26 Oct 2006

tcpdump

One of the early diagnostic tools for TCP/IP that was written by Van Jacobson, Craig Leres, and Steven McCanne. Tcpdump can be used to capture and decode packets in real time, or to capture packets to files (in "libpcap" format, see below), and analyze (decode) them later.

There are now more elaborate and, in some ways, user-friendly packet capturing programs, such as Wireshark (formerly called Ethereal), but tcpdump is widely available, widely used, so it is very useful to know how to use it.

Tcpdump/libpcap is still actively being maintained, although not by its original authors.

libpcap

Libpcap is a library that is used by tcpdump, and also names a file format for packet traces. This file format - usually used in files with the extension .pcap - is widely supported by packet capture and analysis tools.

Selected options

Some useful options to tcpdump include:

  • -s snaplen capture snaplen bytes of each frame. By default, tcpdump captures only the first 68 bytes, which is sufficient to capture IP/UDP/TCP/ICMP headers, but usually not payload or higher-level protocols. If you are interested in more than just headers, use -s 0 to capture packets without truncation.
  • -r filename read from an previously created capture file (see -w)
  • -w filename dump to a file instead of analyzing on-the-fly
  • -i interface capture on an interface other than the default (first "up" non-loopback interface). Under Linux, -i any can be used to capture on all interfaces, albeit with some restrictions.
  • -c count stop the capture after count packets
  • -n don't resolve addresses, port numbers, etc. to symbolic names - this avoids additional DNS traffic when analyzing a live capture.
  • -v verbose output
  • -vv more verbose output
  • -vvv even more verbose output

Also, a pflang expression can be appended to the command so as to filter the captured packets. An expression is made up of one or more of "type", "direction" and "protocol".

  • type Can be host, net or port. host is presumed unless otherwise speciified
  • dir Can be src, dst, src or dst or src and dst
  • proto (for protocol) Common types are ether, ip, tcp, udp, arp ... If none is specifiied then all protocols for which the value is valid are considered.
Example expressions:
  • dst host <address>
  • src host <address>
  • udp dst port <number>
  • host <host> and not port ftp and not port ftp-data

Usage examples

Capture a single (-c 1) udp packet to file test.pcap:

: root@diotima[tmp]; tcpdump -c 1 -w test.pcap udp
tcpdump: listening on bge0, link-type EN10MB (Ethernet), capture size 96 bytes
1 packets captured
3 packets received by filter
0 packets dropped by kernel

This produces a binary file containing the captured packet as well as a small file header and a timestamp:

: root@diotima[tmp]; ls -l test.pcap
-rw-r--r-- 1 root root 114 2006-04-09 18:57 test.pcap
: root@diotima[tmp]; file test.pcap
test.pcap: tcpdump capture file (big-endian) - version 2.4 (Ethernet, capture length 96)

Analyze the contents of the previously created capture file:

: root@diotima[tmp]; tcpdump -r test.pcap
reading from file test.pcap, link-type EN10MB (Ethernet)
18:57:28.732789 2001:630:241:204:211:43ff:fee1:9fe0.32832 > ff3e::beac.10000: UDP, length: 12

Display the same capture file in verbose mode:

: root@diotima[tmp]; tcpdump -v -r test.pcap
reading from file test.pcap, link-type EN10MB (Ethernet)
18:57:28.732789 2001:630:241:204:211:43ff:fee1:9fe0.32832 > ff3e::beac.10000: [udp sum ok] UDP, length: 12 (len 20, hlim 118)

More examples with some advanced tcpdump use cases.

References

-- SimonLeinen - 2006-03-04 - 2016-03-17

Changed:
<
<
Warning: Can't find topic PERTKB.EthEreal
>
>

Wireshark®

Wireshark is a packet capture/analysis tool, similar to tcpdump but much more elaborate. It has a graphical user interface (GUI) which allows "drilling down" into the header structure of captured packets. In addition, it has a "plugin" architecture that allows decoders ("dissectors" in Wireshark terminology) to be written with relative ease. This and the general user-friendliness of the tool has resulted in Wireshark supporting an abundance of network protocols - presumably writing Wireshark dissectors is often part of the work of developing/implementing new protocols. Lastly, Wireshark includes some nice graphical analysis/statictics tools such as much of the functionality of tcptrace and xplot.

One of the main attractions of Wireshark is that it works nicely under Microsoft Windows, although it requires a third-party library to implement the equivalent of the libpcap packet capture library.

Wireshark used to be called Ethereal™, but was renamed in June 2006, when its principal maintainer changed employers. Version 1.8 adds support for capturing on multiple interfaces in parallel, simplified management of decryption keys (for 802.11 WLANs and IPsec/ISAKMP), and geolocation for IPv6 addresses. Some TCP events (fast retransmits and TCP Window updates) are no longer flagged as warnings/errors. The default format for saving capture files is changing from pcap to pcap-ng. Wireshark 1.10, announced in June 2013, adds many features including Windows 8 support and response-time analysis for HTTP requests.

Usage examples

The following screenshot shows Ethereal 0.10.14 under Linux/Gnome when called as ethereal -r test.pcap, reading the single-packet example trace generated in the tcpdump example. The "data" portion of the UDP part of the packet has been selected by clicking on it in the middle pane, and the corresponding bytes are highlighted in the lower pane.

ethereal -r test.pcap screendump

tshark

The package includes a command-line tool called tshark, which can be used in a similar (but not quite compatible) way to tcpdump. Through complex command-line options, it can give access to some of the more advanced decoding functionality of Wireshark. Because it generates text, it can be used as part of analysis scripts.

Scripting

Wireshark can be extended using scripting languages. The Lua language has been supported for several years. Version 1.4.0 (released in September 2010) added preliminary support for Python as an extension language.

CloudShark

In a cute and possibly even useful application of tshark, QA Cafe (an IP testing solutions vendor) has put up "Wireshark as a Service" under www.cloudshark.org. This tool lets users upload packet dumps without registration, and provides the familiar Wireshark interface over the Web. Uploads are limited to 512 Kilobytes, and there are no guarantees about confidentiality of the data, so it should not be used on privacy-sensitive data.

References

-- SimonLeinen - 2006-03-04 - 2013-06-09

 

Solaris snoop

Sun's Solaris operating system includes a utility called snoop for taking and analyzing packet traces. While snoop is similar in spirit to tcpdump, its options, filter syntax, and stored file formats are all slightly different. The file format is described in the Informational RFC 1761, and supported by some other tools, notably Wireshark.

Selected options

Useful options in connection with snoop include:

  • -i filename read from capture file. Corresponds to tcpdump's -r option.
  • -o filename dump to file. Corresponds to tcpdump's -w option.
  • -d device capture on device rather than on the default (first non-loopback) interface. Corresponds to tcpdump's -i option.
  • -c count stop capture after capturing count packets (same as for tcpdump)
  • -r do not resolve addresses to names. Corresponds to tcpdump's -n option. This is useful for suppressing DNS traffic when looking at a live trace.
  • -V verbose summary mode - between (default) summary mode and verbose mode
  • -v (full) verbose mode

Usage examples

Capture a single (-c 1) udp packet to file test.snoop:

: root@diotima[tmp]; snoop -o test.snoop -c 1 udp
Using device /dev/bge0 (promiscuous mode)
1 1 packets captured

This produces a binary file containing the captured packet as well as a small file header and a timestamp:

: root@diotima[tmp]; ls -l test.snoop
-rw-r--r-- 1 root root 120 2006-04-09 18:27 test.snoop
: root@diotima[tmp]; file test.snoop
test.snoop:     Snoop capture file - version 2

Analyze the contents of the previously created capture file:

: root@diotima[tmp]; snoop -i test.snoop
  1   0.00000 2001:690:1fff:1661::3 -> ff1e::1:f00d:beac UDP D=10000 S=32820 LEN=20

Display the same capture file in "verbose summary" mode:

: root@diotima[tmp]; snoop -V -i test.snoop
________________________________
  1   0.00000 2001:690:1fff:1661::3 -> ff1e::1:f00d:beac ETHER Type=86DD (IPv6), size = 74 bytes
  1   0.00000 2001:690:1fff:1661::3 -> ff1e::1:f00d:beac IPv6  S=2001:690:1fff:1661::3 D=ff1e::1:f00d:beac LEN=20 HOPS=116 CLASS=0x0 FLOW=0x0
  1   0.00000 2001:690:1fff:1661::3 -> ff1e::1:f00d:beac UDP D=10000 S=32820 LEN=20

Finally, in full verbose mode:

: root@diotima[tmp]; snoop -v -i test.snoop
ETHER:  ----- Ether Header -----
ETHER:
ETHER:  Packet 1 arrived at 18:27:16.81233
ETHER:  Packet size = 74 bytes
ETHER:  Destination = 33:33:f0:d:be:ac, (multicast)
ETHER:  Source      = 0:a:f3:32:56:0,
ETHER:  Ethertype = 86DD (IPv6)
ETHER:
IPv6:   ----- IPv6 Header -----
IPv6:
IPv6:   Version = 6
IPv6:   Traffic Class = 0
IPv6:   Flow label = 0x0
IPv6:   Payload length = 20
IPv6:   Next Header = 17 (UDP)
IPv6:   Hop Limit = 116
IPv6:   Source address = 2001:690:1fff:1661::3
IPv6:   Destination address = ff1e::1:f00d:beac
IPv6:
UDP:  ----- UDP Header -----
UDP:
UDP:  Source port = 32820
UDP:  Destination port = 10000
UDP:  Length = 20
UDP:  Checksum = 0635
UDP:

Note: the captured packet is an IPv6 multicast packet from a "beacon" system. Such traffic forms the majority of the background load on our office LAN at the moment.

References

  • RFC 1761, Snoop Version 2 Packet Capture File Format, B. Callaghan, R. Gilligan, February 1995
  • snoop(1) manual page for Solaris 10, April 2004, docs.sun.com

-- SimonLeinen - 09 Apr 2006

Web100 Linux Kernel Extensions

The Web100 project was run by PSC (Pittsburgh Supercomputing Center), the NCAA and NCSA. It was funded by the US National Science Foundation (NSF) between 2000 and 2003, although development and maintenance work extended well beyond the period of NSF funding. Its thrust was to close the "wizard gap" between what performance should be possible on modern research networks and what most users of these networks actually experience. The project focused on instrumentation for TCP to measure performance and find possible bottlenecks that limit TCP throughput. In addition, it included some work on "auto-tuning" of TCP buffer settings.

Most implementation work was done for Linux, and most of the auto-tuning code is now actually included in the mainline Linux kernel code (as of 2.6.17). The TCP kernel instrumentation is available as a patch from http://www.web100.org/, and usually tracks the latest "official" Linux kernel release pretty closely.

An interesting application of Web100 is NDT, which can be used from any Java-enabled browser to detect bottlenecks in TCP configurations and network paths, as well as duplex mismatches using active TCP tests against a Web100-enabled server.

In September 2010, the NSF agreed to fund a follow-on project called Web10G.

TCP Kernel Information Set (KIS)

A central component of Web100 is a set of "instruments" that permits the monitoring of many statistics about TCP connections (sockets) in the kernel. In the Linux implementation, these instruments are accessible through the proc filesystem.

TCP Extended Statistics MIB (TCP-ESTATS-MIB, RFC 4898)

The TCP-ESTATS-MIB (RFC 4898) includes a similar set of instruments, for access through SNMP. It has been implemented by Microsoft for the Vista operating system and later versions of Windows. In the Windows Server 2008 SDK, a tool called TcpAnalyzer.exe can be used to look at statistics of open TCP connections. IBM is also said to have an implementation of this MIB.

"Userland" Tools

Besides the kernel extension, Web100 comprises a small set of user-level tools which provide access to the TCP KIS. These tools include

  1. a libweb100 library written in C
  2. the command-line tools readvar, deltavar, writevar, and readall
  3. a set of GTK+-based GUI (graphical user interface) tools under the gutil command.

gutil

When started, gutil shows a small main panel with an entry field for specifying a TCP connection, and several graphical buttons for starting different tools on a connection once one has been selected.

gutil: main panel

The TCP connection can be chosen either by explicitly specifying its endpoints, or by selecting from a list of connections (using double-click):

gutil: TCP connection selection dialog

Once a connection of interest has been selected, a number of actions are possible. The "List" action provides a list of all kernel instruments for the connection. The list is updated every second, and "delta" values are displayed for those variables that have changed.

gutil: variable listing

Another action is "Display", which provides a graphical display of a KIS variable. The following screenshot shows the display of the DataBytesIn variable of an SSH connection.

gutil: graphical variable display

Related Work

Microsoft's Windows Software Development Kit for Server 2008 and .NET Version 3.5 contains a tool called TcpAnalyzer.exe, which is similar to gutil and uses Microsoft's RFC 4898 implementation.

The SIFTR module for FreeBSD can be used for similar applications, namely to understand what is happening inside TCP at fine granularity. Sun's DTrace would be an alternative on systems that support it, provided they offer suitable probes for the relevant actions and events within TCP. Both SIFTR and DTrace have very different user interfaces to Web100.

References

-- SimonLeinen - 27 Feb 2006 - 25 Mar 2011
-- ChrisWelti - 12 Jan 2010

Revision 32006-04-09 - SimonLeinen

Line: 1 to 1
 
META TOPICPARENT name="WebHome"

DS3.3.3a - Basic Guide to Network Performance

Added:
>
>

General Background and Context

Motivation and Purpose

The GN2 Performance Enhancement and Response Team (PERT) is a virtual team committed to helping academic users get efficient network performance for their end-systems. There is an emerging need for this kind of support, due to the proliferation of systems which are dependent on Long Fat pipe Networks (LFNs), and whose requirements exceed the scope of the original TCP specification (TCP being the most common of the network transport protocols used on top of the Internet Protocol (IP)).

The PERT offers support in two ways. First, they respond to requests to investigate quality of service (QoS) issues submitted by the PERT Primary Customers. The PERT Primary Customers are the European NRENs, peer academic networks and certain international academic research projects (the full list of PERT Primary Customers is given in ‘The PERT Policy’ (GN2-05-18). The purpose of specifying a relatively small number of Primary Customers is to ensure that the majority of potential users of the PERT (the PERT’s ‘End Customers’) first contact their NREN, who will be able to help with minor problems, or those issues that are in fact outside the scope of the PERT (such as network hardware failure).

The second way in which the PERT helps its clients is by producing reference documentation that end users and network administrators can use for self-help. This documentation explains the concepts of network performance, highlights the most common causes of quality of service (QoS) problems, and offers general guidelines on how to configure systems to optimise performance. It also provides pointers to other resources for troubleshooting issues, such as the NREN network statistics and monitoring tools that will help end users and network administrators to determine quickly for themselves whether current problems are likely to be the result of changes in network conditions. This information is all held in the PERT Knowledgebase, an online resource that is accessible to all (see GEANT2 web site, under ‘Users’ -> ‘PERT’ for a link).

In order to showcase the work of the PERT in this area, a snapshot has been taken of the content of the PERT Knowledgebase and this has been used to create two self-contained guides to network quality of service. The ‘User Guide’, as its name suggests, is targeted at end-users, and in particular those end-users who have demanding requirements (typically those who depend on LFNs), and the network administrators who supportthem. The User Guide is at Annex A of this document. The second guide (Annex B) is called the ‘Best Practice Guide’. This document is aimed more at campus network administrators, since it concentrates on issues which can not be controlled by re-configuring or upgrading end-systems. In fact the Best Practice Guide is really an addendum to the User Guide, in as much as its target audience (network administrators) is a subset of the User Guide’s target audience.

Conclusions and Future Work

Whilst the online PERT Knowledgebase, which will always have the most up to date content, is expected to be the main reference source for PERT users seeking specific advice on a given subject, the two guides presented here will provide a useful and easy to read introduction to the main issues surrounding network performance issues today.

A significant amount of effort was put into building up the PERT Knowledgebase, prior to producing these two guides, and the effort is planned to continue, albeit at a slower, steadier pace, over the rest of GEANT2. The PERT Knowledge base will be kept under regular review, and as and when it has changed significantly from its present state the guides will be re-published.

-- SimonLeinen - 09 Apr 2006

Glossary of Abbreviations and Acronyms

ACK - TCP acknowledgement packet
AIMD - Additive Increase/Multiplicative Decrease
AQM - Active Queue Management
ASIC - Application-Specific Integrated Circuit
ATM - Asynchronous Transfer Mode
BDP - Bandwidth-delay product
CE - Congestion Experienced
CPU - Central Processing Unit
CSMA/CD - Collision Sense Multiple Access/Collision Detection
Cwnd - Congestion Window
CWR - Congestion Window Reduced
DFN - Deutsches Forschungsnetz
ECE - ECN-Echo
ECMP - Equal Cost Multipath
ECN - Explicit Congestion Notification
ECT - ECN-Capable Transport
EGEE - Enabling Grids for E-Science in Europe
FDDI - Fiber Distributed Data Interface
FTP - File Transfer Protocol
HTTP - HyperText Transfer Protocol
ICMP - Internet Control Message Protocol
IEEE - Institute of Electrical and Electronics Engineers
IETF - Internet Engineering Task Force
IPDV - IP Delay Variation Metric
IP - Internet Protocol
IPPM - IP Performance Metrics, an IETF Working Group
LFN - Long Fat Network, a network with a large BDP
LSO - Large Send Offload
MDT - Multidata Transmit
MSS - maximum segment size
MTU - Maximum Transmission Unit
NIC - Network Interface Card
NIST - National Institute of Standards and Technology
NREN - National Research and Education Network
OS - Operating System
OWAMP - One-Way Active Measurement Protocol
OWD - One-Way Delay
PAWS - Protection Against Wrapped Sequence-numbers
PDU - Protocol Data Unit
PERT - Performance Enhancement & Response Team
PMTUD - Path MTU Discovery
POS - Packet Over Sonet
RCP - Rate Control Protocol
RCP - Remote Copy
RED - Random Early Detection, a form of active queue management
RENATER - Réseau National de télécommunications pour la Technologie l'Enseignement et la Recherche
RFC - Request For Comment - Technical and organisational notes about the Internet
RIPE NCC - RIPE Network Control Center
RIPE - Réseaux IP Européens
RTCP - RTP Control Protocol
RTP - Real-time Transport Protocol
RTT - Round-trip time
SACK - Selective ACKnowledgement
SCP - Secure Copy
SCTP - Stream Control Transmission Protocol
SMDS - Switched Multimegabit Data Service
SNMP - Simple Network Management Protocol
SSH - Secure Shell
Ssthresh - Slow Start Threshold
SYN - TCP synchronisation packet
TCP - Transmission Control Protocol
TOE - TCP Offload Engine
TSO - TCP Segmentation Offload
TTL - Time To Live
TTM - Test Traffic Measurements, a service provided by RIPE NCC
UDP - User Datagram Protocol
VOP - Velocity of Propagation
WIDE - Widely Integrated Distributed Environment
WLAN - Wireless Local Area Network
XCP - eXplicit Control Protocol

-- SimonLeinen - 09 Apr 2006

Abstract

This paper is a guide to network performance issues. We present the fundamentals of network performance, and then look at ways of investigating and remedying problems on end-systems. We finish with several case studies which demonstrate the effects of the issues previously examined.

-- SimonLeinen - 09 Apr 2006

 

Notes About Performance

This part of the PERT Knowledgebase contains general observations and hints about the performance of networked applications.

The mapping between network performance metrics and user-perceived performance is complex and often non-intuitive.

-- SimonLeinen - 13 Dec 2004 - 07 Apr 2006

User-Perceived Performance

User-perceived performance of network applications is made up of a number of mainly qualitative metrics, some of which are in conflict with each other. In the case of some applications, a single metric will outweigh the others, such as responsiveness from video services or throughput for bulk transfer applications. More commonly, a combination of factors usually determines the experience of network performance for the end-user.

Responsiveness

One of the most important user experiences in networking applications is the perception of responsiveness. If end-users feel that an application is slow, it is often the case that it is slow to respond to them, rather than being directly related to network speed. This is a particular issue for real-time applications such as audio/video conferencing systems and must be prioritised in applications such as remote medical services and off-campus teaching facilities. It can be difficult to quantitatively define an acceptable figure for response times as the requirements may vary from application to application.

However, some applications have relatively well-defined "physiological" bounds beyond which the responsiveness feeling vanishes. For example, for voice conversations, a (round-trip) delay of 150ms is practically unnoticeable, but an only slightly larger delay is typically felt as very intrusive.

Throughput/Capacity/"Bandwidth"

Throughput per se is not directly perceived by the user, although a lack of throughput will increase waiting times and reduce the impression of responsiveness. However, "bandwidth" is widely used as a "marketing" metric to differentiate "fast" connections from "slow" ones, and many applications display throughput during long data transfers. Therefore users often have specific performance expectations in terms of "bandwidth", and are disappointed when the actual throughput figures they see is significantly lower than the advertised capacity of their network connection.

Reliability

Reliability is often the most important performance criterion for a user: The application must be available when the user requires it. Note that this doesn't necessarily mean that it is always available, although there are some applications - such as the public Web presence of a global corporation - that have "24x7" availability requirements.

The impression of reliability will also be heavily influenced by what happens (or is expected to happen) in cases of unavailability: Does the user have a possibility to build a workaround? In case of provider problems: Does the user have someone competent to call - or can they even be sure that the provider will notice the problem themselves, and fix it in due time? How is the user informed during the outage, in particular concerning the estimated time to repair?

Another aspect of reliability is the predictability of performance. It can be profoundly disturbing to a user to see large performance variations over time, even if the varying performance is still within the required performance range - who can guarantee that variations won't increase beyond the tolerable during some other time when the application is needed? E.g., a 10 Mb/s throughput that remains rock-stable over time can feel more reliable than throughput figures that vary between 200 and 600 Mb/s.

-- SimonLeinen - 07 Apr 2006
-- AlessandraScicchitano - 10 Feb 2012

Responsiveness

One of the most important user experiences in networking applications is the perception of responsiveness. If end-users feel that an application is slow, it is often the case that it is slow to respond to them, rather than being directly related to network speed. This is a particular issue for real-time applications such as audio/video conferencing systems and must be prioritised in applications such as remote medical services and off-campus teaching facilities. It can be difficult to quantitatively define an acceptable figure for response times as the requirements may vary from application to application.

However, some applications have relatively well-defined "physiological" bounds beyond which the responsiveness feeling vanishes. For example, for voice conversations, a (round-trip) delay of 150ms is practically unnoticeable, but an only slightly larger delay is typically felt as very intrusive.

-- SimonLeinen - 07 Apr 2006 - 23 Aug 2007

Revision 22006-04-09 - SimonLeinen

Line: 1 to 1
 
META TOPICPARENT name="WebHome"

DS3.3.3a - Basic Guide to Network Performance

Line: 15 to 15
 

TCP Acknowledgements

In TCP's sliding-window scheme, the receiver acknowledges the data it receives, so that the sender can advance the window and send new data. As originally specified, TCP's acknowledgements ("ACKs") are cumulative: the receiver tells the sender how much consecutive data it has received. More recently, selective acknowledgements were introduced to allow more fine-grained acknowledgements of received data.

Delayed Acknowledgements and "ack every other"

RFC 831 first suggested a delayed acknowledgement (Delayed ACK) strategy, where a receiver doesn't always immediately acknowledge segments as it receives them. This recommendation was carried forth and specified in more detail in RFC 1122 and RFC 5681 (formerly known as RFC 2581). RFC 5681 mandates that an acknowledgement be sent for at least every other full-size segment, and that no more than 500ms expire before any segment is acknowledged.

The resulting behavior is that, for longer transfers, acknowledgements are only sent for every two segments received ("ack every other"). This is in order to reduce the amount of reverse flow traffic (and in particular the number of small packets). For transactional (request/response) traffic, the delayed acknowledgement strategy often makes it possible to "piggy-back" acknowledgements on response data segments.

A TCP segment which is only an acknowledgement i.e. has no payload, is termed a pure ack.

Delayed Acknowledgments should be taken into account when doing RTT estimation. As an illustration, see this change note for the Linux kernel from Gavin McCullagh.

Critique on Delayed ACKs

John Nagle nicely explains problems with current implementations of delayed ACK in a comment to a thread on Hacker News:

Here's how to think about that. A delayed ACK is a bet. You're betting that there will be a reply, upon with an ACK can be piggybacked, before the fixed timer runs out. If the fixed timer runs out, and you have to send an ACK as a separate message, you lost the bet. Current TCP implementations will happily lose that bet forever without turning off the ACK delay. That's just wrong.

The right answer is to track wins and losses on delayed and non-delayed ACKs. Don't turn on ACK delay unless you're sending a lot of non-delayed ACKs closely followed by packets on which the ACK could have been piggybacked. Turn it off when a delayed ACK has to be sent.

Duplicate Acknowledgements

A duplicate acknowledgement (DUPACK) is one with the same acknowledgement number as its predecessor - it signifies that the TCP receiver has received a segment newer than the one it was expecting i.e. it has missed a segment. The missed segment might not be lost, it might just be re-ordered. For this reason the TCP sender will not assume data loss on the first DUPACK but (by default) on the third DUPACK, when it will, as per RFC 5681, perform a "Fast Retransmit", by sending the segment again without waiting for a timeout. DUPACKs are never delayed; they are sent immediately the TCP receiver detects an out-of-order segment.

"Quickack mode" in Linux

The TCP implementation in Linux has a special receive-side feature that temporarily disables delayed acknowledgements/ack-every-other when the receiver thinks that the sender is in slow-start mode. This speeds up the slow-start phase when RTT is high. This "quickack mode" can also be explicitly enabled or disabled with setsockopt() using the TCP_QUICKACK option. The Linux modification has been criticized because it makes Linux' slow-start more aggressive than other TCPs' (that follow the SHOULDs in RFC 5681), without sufficient validation of its effects.

"Stretch ACKs"

Techniques such as LRO and GRO (and to some level, Delayed ACKs as well as ACKs that are lost or delayed on the path) can cause ACKs to cover many more than the two segments suggested by the historic TCP standards. Although ACKs in TCP have always been defined as "cumulative" (with the exception of SACKs), some congestion control algorithms have trouble with ACKs that are "stretched" in this way. In January 2015, a patch set was submitted to the Linux kernel network development with the goal to improve the behavior of the Reno and CUBIC congestion control algorithms when faced with stretch ACKs.

References

  • TCP Congestion Control, RFC 5681, M. Allman, V. Paxson, E. Blanton, September 2009
  • RFC 813, Window and Acknowledgement Strategy in TCP, D. Clark, July 1982
  • RFC 1122, Requirements for Internet Hosts -- Communication Layers, R. Braden (Ed.), October 1989

-- TobyRodwell - 2005-04-05
-- SimonLeinen - 2007-01-07 - 2015-01-28

TCP Window Scaling Option

TCP as originally specified only allows for windows up to 65536 bytes, because the window-related protocol fields all have a width of 16 bits. This prevents the use of large TCP windows, which are necessary to achieve high transmission rates over paths with a large bandwidth*delay product.

The Window Scaling Option is one of several options defined in RFC 7323, "TCP Extensions for High Performance". It allows TCP to advertise and use a window larger than 65536 bytes. The way this works is that in the initial handshake, the a TCP speaker announces a scaling factor, which is a power of two between 20 (no scaling) and 214, to allow for an effective window of 230 (one Gigabyte). Window Scaling only comes into effect when both ends of the connection advertise the option (even with just a scaling factor of 20).

As explained under the large TCP windows topic, Window Scaling is typically used in connection with the TCP Timestamp option and the PAWS (Protection Against Wrapped Sequence numbers), both defined in RFC 7323.

32K limitations on some systems when Window Scaling is not used

Some systems, notably Linux, limit the usable window in the absence of Window Scaling to 32 Kilobytes. The reasoning behind this is that some (other) systems used in the past were broken by larger windows, because they erroneously interpreted the 16-bit values related to TCP windows as signed, rather than unsigned, integers.

This limitation can have the effect that users (or applications) that request window sizes between 32K and 64K, but don't have Window Scaling enabled, do not actually benefit from the desired window sizes.

The artificial limitation was finally removed from the Linux kernel sources on March 21, 2006. The old behavior (window limited to 32KB when Window Scaling isn't used) can still be enabled through a sysctl tuning parameter, in case one really wants to interoperate with TCP implementations that are broken in the way described above.

Problems with window scaling

Middlebox issues

MiddleBoxes which do not understand window scaling may cause very poor performance, as described in WindowScalingProblems. Windows Vista limits the scaling factor for HTTP transactions to 2 to avoid some of the problems, see WindowsOSSpecific.

Diagnostic issues

The Window Scaling option has an impact on how the offered window in TCP ACKs should be interpreted. This can cause problems when one is faced with incomplete packet traces that lack the initial SYN/SYN+ACK handshake where the options are negotiated: The SEQ/ACK sequence may look incorrect because scaling cannot be taken into account. One effect of this is that tcpgraph or similar analysis tools (included in tools such as WireShark) may produce incoherent results.

References

  • RFC 7323, TCP Extensions for High Performance, D. Borman, B. Braden, V. Jacobson, B. Scheffenegger, September 2014 (obsoletes RFC 1323 from May 1993)
  • git note about the Linux kernel modification that got rid of the 32K limit, R. Jones and D.S. Miller, March 2006. Can also be retrieved as a patch.

-- SimonLeinen - 21 Mar 2006 - 27 September 2014
-- PekkaSavola - 07 Nov 2006
-- AlexGall - 31 Aug 2007 (added reference to scaling limitiation in Windows Vista)

TCP Flow Control

Note: This topic describes the Reno enhancement of classical "Van Jacobson" or Tahoe congestion control. There have been many suggestions for improving this mechanism - see the topic on high-speed TCP variants.

TCP flow control and window size adjustment is mainly based on two key mechanism: Slow Start and Additive Increase/Multiplicative Decrease (AIMD), also known as Congestion Avoidance. (RFC 793 and RFC 5681)

Slow Start

To avoid that a starting TCP connection floods the network, a Slow Start mechanism was introduced in TCP. This mechanism effectively probes to find the available bandwidth.

In addition to the window advertised by the receiver, a Congestion Window (cwnd) value is used and the effective window size is the lesser of the two. The starting value of the cwnd window is set initially to a value that has been evolving over the years, the TCP Initial Window. After each acknowledgment, the cwnd window is increased by one MSS. By this algorithm, the data rate of the sender doubles each round-trip time (RTT) interval (actually, taking into account Delayed ACKs, rate increases by 50% every RTT). For a properly implemented version of TCP this increase continues until:

  • the advertised window size is reached
  • congestion (packet loss) is detected on the connection.
  • there is no traffic waiting to take advantage of an increased window (i.e. cwnd should only grow if it needs to)

When congestion is detected, the TCP flow-control mode is changed from Slow Start to Congestion Avoidance. Note that some TCP implementations maintain cwnd in units of bytes, while others use units of full-sized segments.

Congestion Avoidance

Once congestion is detected (through timeout and/or duplicate ACKs), the data rate is reduced in order to let the network recover.

Slow Start uses an exponential increase in window size and thus also in data rate. Congestion Avoidance uses a linear growth function (additive increase). This is achieved by introducing - in addition to the cwnd window - a slow start threshold (ssthresh).

As long as cwnd is less than ssthresh, Slow Start applies. Once ssthresh is reached, cwnd is increased by at most one segment per RTT. The cwnd window continues to open with this linear rate until a congestion event is detected.

When congestion is detected, ssthresh is set to half the cwnd (or to be strictly accurate, half the "Flight Size". This distinction is important if the implementation lets cwnd grow beyond rwnd (the receiver's declared window)). cwnd is either set to 1 if congestion was signalled by a timeout, forcing the sender to enter Slow Start, or to ssthresh if congestion was signalled by duplicate ACKs and the Fast Recovery algorithm has terminated. In either case, once the sender enters Congestion Avoidance, its rate has been reduced to half the value at the time of congestion. This multiplicative decrease causes the cwnd to close exponentially with each detected loss event.

Fast Retransmit

In Fast Retransmit, the arrival of three duplicate ACKs is interpreted as packet loss, and retransmission starts before the retransmission timer (RTO) expires.

The missing segment will be retransmitted immediately without going through the normal retransmission queue processing. This improves performance by eliminating delays that would suspend effective data flow on the link.

Fast Recovery

Fast Recovery is used to react quickly to a single packet loss. In Fast recovery, the receipt of 3 duplicate ACKs, while being taken to mean a loss of a segment, does not result in a full Slow Start. This is because obviously later segments got through, and hence congestion is not stopping everything. In fast recovery, ssthresh is set to half of the current send window size, the missing segment is retransmitted (Fast Retransmit) and cwnd is set to ssthresh plus three segments. Each additional duplicate ACK indicates that one segment has left the network at the receiver and cwnd is increased by one segment to allow the transmission of another segment if allowed by the new cwnd. When an ACK is received for new data, cwmd is reset to the ssthresh, and TCP enters congestion avoidance mode.

References

-- UlrichSchmid - 07 Jun 2005 -- SimonLeinen - 27 Jan 2006 - 23 Jun 2011

Deleted:
<
<

High-Speed TCP Variants

There have been numerous ideas for improving TCP over the years. Some of those ideas have been adopted by mainstream operations (after thorough review). Recently there has been an uptake in work towards improving TCP's behavior with LongFatNetworks. It has been proven that the current congestion control algorithms limit the efficiency in network resource utilization. The various types of new TCP implementations introduce changes in defining the size of congestion window. Some of them require extra feedback from the network. Generally, all such congestion control protocols are divided as follows.

An orthogonal technique for improving TCP's performance is automatic buffer tuning.

Comparative Studies

All papers about individual TCP-improvement proposals contain comparisons against older TCP to quantify the improvement. There are several studies that compare the performance of the various new TCP variants, including

Other References

  • Gigabit TCP, G. Huston, The Internet Protocol Journal, Vol. 9, No. 2, June 2006. Contains useful descriptions of many modern TCP enhancements.
  • Faster, G. Huston, June 2005. This article from ISP Column looks at various approaches to TCP transfers at very high rates.
  • Congestion Control in the RFC Series, M. Welzl, W. Eddy, July 2006, Internet-Draft (work in progress)
  • ICCRG Wiki, IRTF (Internet Research Task Force) ICCRG (Internet Congestion Control Research Group), includes bibliography on congestion control.
  • RFC 6077: Open Research Issues in Internet Congestion Control, D. Papadimitriou (Ed.), M. Welzl, M. Scharf, B. Briscoe, February 2011

Related Work

  • PFLDnet (International Workshop on Protocols for Fast Long-Distance Networks): 2003, 2004, 2005, 2006, 2007, 2009, 2010. (The 2008 pages seem to have been removed from the Net.)

-- Simon Leinen - 2005-05-28–2016-09-18 -- Orla Mc Gann - 2005-10-10 -- Chris Welti - 2008-02-25

HighSpeed TCP

HighSpeed TCP is a modification to TCP congestion control mechanism, proposed by Sally Floyd from ICIR (The ICSI Center for Internet Research). The main problem of TCP connection is that it takes a long time to make a full recovery from packet loss for high-bandwidth long-distance connections. Like the others, new TCP implementations proposes a modification of congestion control mechanism for use with TCP connections with large congestion windows. For the current standard TCP with a steady-state packet loss rate p, an average congestion window is 1.2/sqrt(p) segments. It places a serious constraint in realistic environments.

Achieving an average TCP throughput of B bps requires a loss event at most every BR/(12D) round-trip times and an average congestion window W is BR/(8D). For round-trip time R = 0.1 seconds and 1500-bytes packets (D), throughput of 10 Gbps would require an average congestion window of 83000 segments, and a packet drop rate (equal 0,0000000002) of at most one congestion event every 5,000,000,000 packets or equivalently at most one congestion event every 1 2/3 hours. This is an unrealistic constraint and has a huge impact on TCP connections with larger congestion windows. So what we need is to achieve high per-connection throughput without requiring unrealistically low packet loss rates.

HighSpeed TCP proposes changing TCP response function w = 1.2/sqrt(p) to achieve high throughput with more realistic requirements for the steady-state packet drop rate. A modified HighSpeed TCP function uses three parameters: Low_Window, High_Window, and High_P. The HighSpeed response function uses the same response function as Standard TCP when the current congestion window is at most Low_Window, and uses the HighSpeed response function when the current congestion window is greater than Low_Window. The value of the average congestion window w greater than Low_Window is

HSTCP_w.png

where,

  • Low_P is the packet drop rate corresponding to Low_Window, using Standard TCP response function,
  • S is a constant as follows

HSTCP_S.png

The window needed to sustain 10Gbps throughput using HighSpeed TCP connection is 83000 segments, so the High_Window could be set to this value. The packet drop rate needed in the HighSpeed reponse function to achieve an average congestion window of 83000 segments is 10^-7. For informations on how to set up the remaining parameters see document of Sally Floyd HighSpeed TCP for Large Congestion Windows. Calibrating appropriate values, we can use bigger congestion window sizes with very lower RTTs between congestion events and achieve a greater throughput.

HighSpeed TCP will have to modify the increase parameter a(w) and the decrease parameter b(w) of AIMD (Additive Increase and Multiplicative Decrease). For Standard TCP, a(w) = 1 and b(w) = 1/2, regardless of the value of w. HighSpeed TCP uses the same values of a(w) and b(w) for w lower than Low_Window. Parameters a(w) and b(w) for HighSpeed TCP are specified below.

HSTCP_aw.png

HSTCP_bw.png

High_Decrease = 0,1 means decrease of 10% of congestion window when congestion event occurre.

For example, for w = 83000, parameters a(w) and b(w) for Standard TCP and HSTCP are specified in table below.

TCPHSTCP
a(w)172
b(w)0.50.1

Summary

Based on the results of HighSpeed TCP and Standard TCP connections that shared bandwidth together we obtain following conclusions. HSTCPs share fairly available bandwidth between themselves. HSTCP may take longer time to converge than Standard TCP. The great disadvantage is that HighSpeed TCP flow starves a TCP flow even in relatively low/moderate bandwidth. The author of the HSTCP specification justifies that there are not a lot of TCP connections effectively operating in this regime today, with large congestion windows, and that therefore the benefits of the HighSpeed response function would outweigh the unfairness that would be experienced by Standard TCP in this regime. Another benefit of applying HSCTP is that it is easier to deploy and no router support needed.

Implementations

HS-TCP is included in Linux as a selectable option in the modular TCP congestion control framework. A few problems with the implementation in Linux 2.6.16 were found by D.X. Wei (see references). These problems were fixed as of Linux 2.6.19.2.

A proposed OpenSolaris project foresees the implementation of HS-TCP and several other congestion control algorithms for OpenSolaris.

References

-- WojtekSronek - 13 Jul 2005
-- HankNussbacher - 06 Oct 2005
-- SimonLeinen - 03 Dec 2006 - 17 Nov 2009

Hamilton TCP

Hamilton TCP (H-TCP), developed by Hamilton Institute, is an enhancement to existing TCP congestion control protocol, with good mathematic and empirical basis. The authors makes an assumption that their improvement should behave as regular TCP in standard LAN and WAN networks where RTT and link capacity are not extremely high. But in the long distant fast networks, H-TCP is far more aggressive and more flexible while adjusting to available bandwidth and avoiding packet losses. Therefore two modes are used for congestion control and mode selection depends on the time since the last experienced congestion. In regular TCP window control algorithm, the window is increased by adding a constant value α and decreased by multiplying by β. The corresponding values are by default equals to 1 and 0.5. In H-TCP case the window estimation is a bit more complex, because both factors are not constant values, but are calculated during transmission.

The parameter values are estimated as follows. For each acknowledgment set:

HTCP_alpha.png

and then

HTCP_alpha2.png

On each congestion event set:

HTCP_beta.png

Where,

  • Δ_i is the time that elapsed since the last congestion experienced by i'th source,
  • Δ_L is the threshold for switching between fast and slow modes,
  • B_i(k) is the throughput of flow i immediately before the k'th congestion event,
  • RTT_max,i and RTT_min,i are maximum and minimum round trip time values for the i'th flow.

The α value depends on the time between experienced congestion events, while β depends on RTT and achieved bandwidth values. The constant values, as well as the Δ_L value, can be modified, in order to adjust the algorithm behavior to user expectations, but the default values are estimated empirically and seem to be the most suitable in most cases. The figures below present the behavior of congestion window, throughput and RTT values during tests performed by the H-TCP authors. In the first case β value was set to 0.5, in the second case the adaptive approach (explained with the equations above) was used. The throughput in the second case is far more stable and is closing to its maximal value.

HTCP_figure1.jpg

HTCP_figure2.jpg

H-TCP congestion window and throughput behavior (source: �H-TCP: TCP for high-speed and long-distance networks�, D.Leith, R.Shorten).

Summary

Another advantage of H-TCP is fairness and friendliness, which means that the algorithm is not "greedy", and will not consume the whole link capacity, but is able to fairly share the bandwidth with either another H-TCP like transmissions or any other TCP-like transmissions. H-TCP congestion window control improves the dynamic of sending ratio and therefore provides better overall throughput value than regular TCP implementation. On the Hamilton Institute web page (http://www.hamilton.ie/net/htcp/) more information is available, including papers, test results and algorithm implementations for Linux kernels 2.4 and 2.6.

References

-- WojtekSronek - 13 Jul 2005 -- SimonLeinen - 30 Jan 2006 - 14 Apr 2008

TCP Westwood

The current Standard TCP implementation rely on packet loss as an indicator of network congestion. The problem in TCP is that it does not possess the capability to distinguish congestion loss from loss invoked by noisy links. As a consequence, Standard TCP reacts with a drastic reduction of the congestion window. In wireless connections overlapping radio channels, signal attenuation, additional noises have a huge impact on such losses.

TCP Westwood (TCPW) is a small modification of Standard TCP congestion control algorithm. When the sender perceive that congestion has appeared, the sender uses the estimated available bandwidth to set the congestion window and the slow start threshold sizes. TCP Westwood avoids huge reduction of these values and ensure both faster recovery and effective congestion avoidance. It does not require any support from lower and higher layers and does not need any explicit congestion feedback from the network.

Measurement of bandwidth in TCP Westwood lean on simple relationship between amount of data sent by the sender to the time of receiving an acknowledgment. The estimation bandwidth process is a bit similar to the one used for RTT estimation in Standard TCP with additional improvements. When there is an absence of ACKs because no packets were sent, the estimated value goes to zero. The formulas of estimation bandwidth process is shown in many documents about TCP Westwood. Such documents are listed on the UCLA Computer Science Department website.

There are some issues that can potentially disrupt the bandwidth estimation process such a delayed or cumulative ACKs indicating wrong order of received segments. So the source must keep track of the number of duplicated ACKs and should be able to detect delayed ACKs and act accordingly.

As it was said before the general idea is to use the bandwidth estimate value to set the congestion window and the slow start threshold after a congestion episode. Below there is an algorithm of setting these values.

if (n duplicate ACKs are received) {
   ssthresh = (BWE * RTTmin)/seg_size;
   if (cwin > ssthresh) cwin = ssthresh; /* congestion avoid. */
}

where

  • BWE is a estimated value of bandwidth,
  • RTTmin is the smallest RTT value observed during the connection time,
  • seg_size is a length of TCP’s segment payload.

Additional benefits

Based on the results of testing TCP Westwood we can see that its fairness is at least as good, than in widely used Standard TCP even if two flows have different round trip times. The main advantage of new implementation of TCP is that it does not starve the others variants of TCP. So, we can claim that TCP Westwood is also friendly.

References

-- WojtekSronek - 27 Jul 2005

TCP Selective Acknowledgements (SACK)

Selective Acknowledgements are a refinement of TCP's traditional "cumulative" acknowledgements.

SACKs allow a receiver to acknowledge non-consecutive data, so that the sender can retransmit only what is missing at the receiver�s end. This is particularly helpful on paths with a large bandwidth-delay product (BDP).

TCP may experience poor performance when multiple packets are lost from one window of data. With the limited information available from cumulative acknowledgments, a TCP sender can only learn about a single lost packet per round trip time. An aggressive sender could choose to retransmit packets early, but such retransmitted segments may have already been successfully received.

A Selective Acknowledgment (SACK) mechanism, combined with a selective repeat retransmission policy, can help to overcome these limitations. The receiving TCP sends back SACK packets to the sender informing the sender of data that has been received. The sender can then retransmit only the missing data segments.

Multiple packet losses from a window of data can have a catastrophic effect on TCP throughput. TCP uses a cumulative acknowledgment scheme in which received segments that are not at the left edge of the receive window are not acknowledged. This forces the sender to either wait a roundtrip time to find out about each lost packet, or to unnecessarily retransmit segments which have been correctly received. With the cumulative acknowledgment scheme, multiple dropped segments generally cause TCP to lose its ACK-based clock, reducing overall throughput. Selective Acknowledgment (SACK) is a strategy which corrects this behavior in the face of multiple dropped segments. With selective acknowledgments, the data receiver can inform the sender about all segments that have arrived successfully, so the sender need retransmit only the segments that have actually been lost.

The selective acknowledgment extension uses two TCP options. The first is an enabling option, "SACK-permitted", which may be sent in a SYN segment to indicate that the SACK option can be used once the connection is established. The other is the SACK option itself, which may be sent over an established connection once permission has been given by SACK-permitted.

Blackholing Issues

Enabling SACK globally used to be somewhat risky, because in some parts of the Internet, TCP SYN packets offering/requesting the SACK capability were filtered, causing connection attempts to fail. By now, it seems that the increased deployment of SACK has caused most of these filters to disappear.

Performance Issues

Sometimes it is not recommended to enable SACK feature, e.g. for the Linux 2.4.x TCP SACK implementation suffers from significant performance degradation in case of a burst of packet loss. People from CERN observed that a burst of packet loss considerably affects TCP connections with large bandwidth delay product (several MBytes), because the TCP connection doesn't recover as it should. After a burst of loss, the throughput measured in their testbed is close to zero during 70 seconds. This behavior is not compliant with TCP's RFC. A timeout should occur after a few RTTs because one of the loss couldn't be repaired quickly enough and the sender should go back into slow start.

For more information take a look at http://sravot.home.cern.ch/sravot/Networking/TCP_performance/tcp_bug.htm

Additionally, work done at Hamilton Institute found that SACK processing in the Linux kernel is inefficient even for later 2.6 kernels, where at 1Gbps networks performance for a long file transfer can lose about 100Mbps of potential. Most of these issues should have been fixed in Linux kernel 2.6.16.

Detailed Explanation

The following is closely based on a mail that Baruch Even sent to the pert-discuss mailing list on 25 Jan '07:

The Linux TCP SACK handling code has in the past been extremely inefficient - there were multiple passes on a linked list which holds all the sent packets. This linked list on a large BDP link can span 20,000 packets. This meant multiple traversals of this list take longer than it takes for another packet to come. Pretty quickly after a loss with SACK the sender's incoming queue fills up and ACKs start getting dropped. There used to be an anti-DoS mechanism that would drop all packets until the queue had emptied - with a default of 1000 packets that took a long time and could easily drop all the outstanding ACKs resulting in a great degradation of performance. This value is set with proc/sys/net/core/netdev_max_backlog

This situation has been slightly improved in that though there is still a 1000 packet limit for the network queue, it acts as a normal buffer i.e. packets are accepted as soon as the queue dips below 1000 again.

Kernel 2.6.19 should be the preferred kernel now for high speed networks. It is believed it has the fixes for all former major issues (at least those fixes that were accepted to the kernel), but it should be noted that some other (more minor) bugs have appeared, and will need to be fixed in future releases.

Historical Note

Selective acknowledgement schemes were known long before they were added to TCP. Noel Chiappa mentioned that Xerox' PUP protocols had them in a posting to the tcp-ip mailing list in August 1986. Vint Cerf responds with a few notes about the thinking that lead to the cumulative form of the original TCP acknowledgements.

References

-- UlrichSchmid & SimonLeinen - 02 Jun 2005 - 14 Jan 2007
-- BartoszBelter - 16 Dec 2005
-- BaruchEven - 05 Jan 2006

Automatic Tuning of TCP Buffers

Note: This mechanism is sometimes referred to as "Dynamic Right-Sizing" (DRS).

The issues mentioned under "Large TCP Windows" are arguments in favor of "buffer auto-tuning", a promising but relatively new approach to better TCP performance in operating systems. See the TCP auto-tuning zoo reference for a description of some approaches.

Microsoft introduced (receive-side) buffer auto-tuning in Windows Vista. This implementation is explained in a TechNet Magazine "Cable Guy" article.

FreeBSD introduced buffer auto-tuning as part of its 7.0 release.

Mac OS X introduced buffer auto-tuning in release 10.5.

Linux auto-tuning details

Some automatic buffer tuning is implemented in Linux 2.4 (sender-side), and Linux 2.6 implements it for both the send and receive directions.

In a post to the web100-discuss mailing list, John Heffner describes the Linux 2.6.16 (March 2006) Linux implementation as follows:

For the sender, we explored separating the send buffer and retransmit queue, but this has been put on the back burner. This is a cleaner approach, but is not necessary to achieve good performance. What is currently implemented in Linux is essentially what is described in Semke '98, but without the max-min fair sharing. When memory runs out, Linux implements something more like congestion control for reducing memory. It's not clear that this is well-behaved, and I'm not aware of any literature on this. However, it's rarely used in practice.

For the receiver, we took an approach similar to DRS, but not quite the same. RTT is measured with timestamps (when available), rather than using a bounding function. This allows it to track a rise in RTT (for example, due to path change or queuing). Also, a subtle but important difference is that receiving rate is measured by the amount of data consumed by the application, not data received by TCP.

Matt Mathis reports on the end2end-interest mailing list (26/07/06):

Linux 2.6.17 now has sender and receiver side autotuning and a 4 MB DEFAULT maximum window size. Yes, by default it negotiates a TCP window scale of 7.
4 MB is sufficient to support about 100 Mb/s on a 300 ms path or 1 Gb/s on a 30 ms path, assuming you have enough data and an extremely clean (loss-less) network.

References

  • TCP auto-tuning zoo, Web page by Tom Dunegan of Oak Ridge National Laboratory, PDF
  • A Comparison of TCP Automatic Tuning Techniques for Distributed Computing, E. Weigle, W. Feng, 2002, PDF
  • Automatic TCP Buffer Tuning, J. Semke, J. Mahdavi, M. Mathis, SIGCOMM 1998, PS
  • Dynamic Right-Sizing in TCP, M. Fisk, W. Feng, Proc. of the Los Alamos Computer Science Institute Symposium, October 2001, PDF
  • Dynamic Right-Sizing in FTP (drsFTP): Enhancing Grid Performance in User-Space, M.K. Gardner, Wu-chun Feng, M. Fisk, July 2002, PDF. This paper describes an implementation of buffer-tuning at application level for FTP, i.e. outside of the kernel.
  • Socket Buffer Auto-Sizing for High-Performance Data Transfers, R. Prasad, M. Jain, C. Dovrolis, PDF
  • The Cable Guy: TCP Receive Window Auto-Tuning, J. Davies, January 2007, Microsoft TechNet Magazine
  • What's New in FreeBSD 7.0, F. Biancuzzi, A. Oppermann et al., February 2008, ONLamp
  • How to disable the TCP autotuning diagnostic tool, Microsoft Support, Article ID: 967475, February 2009

-- SimonLeinen - 04 Apr 2006 - 21 Mar 2011
-- ChrisWelti - 03 Aug 2006

Application Protocols

-- TobyRodwell - 28 Feb 2005

File Transfer

A common problem for many scientific applications is the replication of - often large - data sets (files) from one system to another. (For the generalized problem of transferring data sets from a source to multiple destinations, see DataDissemination.) Typically this requires reliable transfer (protection against transmission errors) such as provided by TCP, typically access control based on some sort of authentication, and sometimes confidentiality against eavesdroppers, which can be provided by encryption. There are many protocols that can be used for file transfer, some of which are outlined here.

  • FTP, the File Transfer Protocol, was one of the earliest protocols used on the ARPAnet and the Internet, and predates both TCP and IP. It supports simple file operations over a variety of operating systems and file abstractions, and has both a text and a binary mode. FTP uses separate TCP connections for control and data transfer.
  • HTTP, the Hypertext Transfer Protocol, is the basic protocol used by the World Wide Web. It is quite efficient for transferring files, but is typically used to transfer from a server to a client only.
  • RCP, the Berkeley Remote Copy Protocol, is a convenient protocol for transferring files between Unix systems, but lacks real security beyond address-based authentication and clear-text passwords. Therefore it has mostly fallen out of use.
  • SCP is a file-transfer application of the SSH protocol. It provides various modern methods of authentication and encryption, but its current implementations have some performance limitations over "long fat networks" that are addressed under the SSH topic.
  • BitTorrent is an example of a peer-to-peer file-sharing protocol. It employs local control mechanisms to optimize the global problem of replicating a large file to many recipients, by allowing peers to share partial copies as they receive them.
  • VFER is a tool for high-performance data transfer developed at Internet2. It is layered on UDP and implements its own delay-based rate control algorithm in user-space, which is designed to be "TCP friendly". Its security is based on SSH.
  • UDT is another UDP-based bulk transfer protocol, optimized for high-capacity (1 Gb/s and above) wide-area network paths. It has been used in the winning entry at the Supercomputing'06 Bandwidth Challenge.

Several high-performance file transfer protocols are used in the Grid community. The "comparative evaluation..." paper in the references compares FOBS, RBUDP, UDT, and bbFTP. Other protocols include GridFTP, Tsunami and FDT. The eVLBI community uses file transfer tools from the Mark5 software suite: File2Net and Net2File. The ESnet "Fasterdata" knowledge base has a very nice section on Data Transfer Tools, providing both general background information and information about several specific tools. Another useful document is Harry Mangalam's How to transfer large amounts of data via network—a nicely written general introduction to the problem of moving data with many usage examples of specialized tools, including performance numbers and tuning hints.

Network File Systems

Another possibility of exchanging files over the network involves networked file systems, which make remote files transparently accessible in a local system's normal file namespace. Examples for such file systems are:

  • NFS, the Network File System, was initially developed by Sun and is widely utilized on Unix-like systems. Very recently, NFS 4.1 added support for pNFS (parallel NFS), where data access can be striped over multiple data servers.
  • AFS, the Andrew File System from CMU, evolved into DFS (Distributed File System)
  • SMB (Server Message Block) or CIFS (Common Internet File System) is the standard protocol for connecting to "network shares" (remote file systems) in the Windows world.
  • GPFS (General Parallel File System) is a high-performance scalable network file system by IBM.
  • Lustre is an open-source file systems for high-performance clusters, distributed by Sun.

References

-- TobyRodwell - 2005-02-28
-- SimonLeinen - 2005-06-26 - 2015-04-22

-- TobyRodwell - 28 Feb 2005

Secure Shell (SSH)

SSH is a widely used protocol for remote terminal access with secure authentication and data encryption. It is also used for file transfers, using tools such as scp (Secure Copy), sftp (Secure FTP), or rsync-over-ssh.

Performance Issues With SSH

Application Layer Window Limitation

When users use SSH to transfer large files, they often think that performance is limited by the processing power required for encryption and decryption. While this can indeed be an issue in a LAN context, the bottleneck over "long fat networks" (LFNs) is most likely a window limitation. Even when TCP parameters have been tuned to allow sufficiently large TCP Windows, the most common SSH implementation (OpenSSH) has a hardwired window size at the application level. Until OpenSSH 4.7, the limit was ~64K but since then, the limit was increased 16-fold (see below) and window increase logic was made more aggressive.

This limitation is replaced with a more advanced logic in a modification of the OpenSSH software provided by the Pittsburgh Supercomputing Center (see below).

The performance difference is substantial especially when RTT grows. In a test setup, with 45 ms RTT, two Linux systems with 8 MB read/write buffers could achive 1.5 MB/s performance with regular OpenSSH (3.9p1 + 4.3p1). Switching to OpenSSH 5.1p1 + HPN-SSH patches on both ends allow up to 55-70 MB/s (no encryption) or 35/50 MB/s (aes128-cbc/ctr encryption) , with the stable rate somewhat lower, the bottleneck being CPU on one end. By just upgrading the receiver (client) side, transfer could still reach 50 MB/s (with encryption).

Crypto overhead

When the window-size limitation is removed, encryption/decryption performance may become the bottleneck again. So it is useful to choose a "cipher" (encryption/decryption method) that performs well, while still being regarded as sufficiently secure to protect the data in question. Here is a table that displays the performance of several ciphers supported by OpenSSH in a reference setting:

cipher throughput
3des-cbc 2.8MB/s
arcfour 24.4MB/s
aes192-cbc 13.3MB/s
aes256-cbc 11.7MB/s
aes128-ctr 12.7MB/s
aes192-ctr 11.7MB/s
aes256-ctr 11.3MB/s
blowfish-cbc 16.3MB/s
cast128-cbc 7.9MB/s
rijndael-cbc@lysator.liu.se 12.2MB/s

The High Performance Enabled SSH/SCP (HPN-SSH) version also supports an option to the scp program that supports use of the "none" cipher, when confidentiality protection of the transferred data is not required. The program also supports a cipher-switch option where password authentication can be encrypted but the transferred data not.

References

Basics

SSH Performance

-- ChrisWelti - 03 Apr 2006
-- SimonLeinen - 12 Feb 2005 - 26 Apr 2010
-- PekkaSavola - 01 Oct 2008

BitTorrent

BitTorrent is an example of a peer-to-peer file-sharing protocol. It employs local control mechanisms to optimize the global problem of replicating a large file to many recipients, by allowing peers to share partial copies as they receive them. It was developed by Bram Cohen, and is widely used to distribute large files over the Internet. Because much of this copying is for audio and video data and without the accordance of the rights owners of that material, BitTorrent has become a focus of attention of media interest groups such as the Motion Picture Artists of America (MPAA). But BitTorrent is also used to distribute large software archives under "Free Software" or similar legal-redistribution regimes.

References

-- SimonLeinen - 26 Jun 2005

 

Troubleshooting Procedures

PERT Troubleshooting Procedures, as used in the earlier centralised PERT in GN2 (2004-2008), are laid down in GN2 Deliverable DS3.5.2 (see references section), which also contains the User Guide for using the (now-defunct) PERT Ticket System (PTS).

Standard tests

  • Ping - simple RTT/loss measurement using ICMP ECHO requests
  • Fping - ping variant with concurrent measurement of multiple destinations

End-user tests

tweak tools http://www.dslreports.com/tweaks is able to run a simple end-user test, checking such parameters as TCP options, receive window size, data transfer rates. It is all done through the user's web-browser making it a simple test for them to do.

Linux Tools and Checks

ip route show cache

Linux is able to apply specific conditions (MSS, ssthresh) to specific routes. Some parameters (such as MSS, ssthresh) can be set manually (with ip route add|replace ...) whilst others are changed automatically by Linux in response to what it learns from TCP (such parameters include estimated rtt, cwnd and re-ordering). The learned info is stored in the route cache and thus can be shown with ip route show cache. Note, this learning behaviour can actually limit TCP performance - if the last transfer was poor, then the starting TCP parameters will be be pessimistic. For this reason some tools, e.g. bwctl, always flush the route cache before starting a test.

References

  • PERT Troubleshooting Procedures - 2nd Edition, B. Belter, A. Harding, T. Rodwell, GN2 deliverable D3.5.2, August 2006 (PDF)

-- SimonLeinen - 18 Sep 2005 - 17 Aug 2010
-- BartoszBelter - 28 Mar 2006

Measurement Tools

Traceroute-like Tools: traceroute, MTR, PingPlotter, lft, tracepath, traceproto

There is a large and growing number of path-measurement tools derived from the well-known traceroute tool. Those tools all attempt to find the route a packet will take from the source (typically where the tool is running) to a given destination, and to find out some hop-wise performance parameters along the way.

Bandwidth Measurement Tools: pchar, Iperf, bwctl, nuttcp, Netperf, RUDE/CRUDE, ttcp, NDT, DSL Reports

Use: Evaluate the Bandwidth between two points in the network.

Active Measurement Boxes

Use: information about Delay, Jitter and packets loss between two probes.

Passive Measurement Tools

These tools perform their measurement by looking at existing traffic. They can be further classified into

References

-- FrancoisXavierAndreu & SimonMuyal - 06 Jun 2005
-- HankNussbacher - 01 Jun 2005
-- SimonLeinen - 2005-07-15 - 2014-12-27
-- AlessandraScicchitano - 2013-01-30

traceroute-like Tools

Traceroute is used to determine the route a packet takes through the Internet to reach its destination; i.e. the sequence of gateways or "hops" it passes through. Since its inception, traceroute has been widely used for network diagnostics as well as for research in the widest sense.

The basic idea is to send out "probe" packets with artificially small TTL (time-to-live) values, eliciting ICMP "time exceeded" messages from routers in the network, and reconstructing the path from these messages. This is described in more detail under the VanJacobsonTraceroute topic. This original traceroute implementation was followed by many attempts at improving on the idea in various directions: More useful output (often graphically enhanced, sometimes trying to map the route geographically), more detailed measurements along the path, faster operation for many targets ("topology discovery"), and more robustness in the face of packet filters, using different types of suitable probe packets.

Traceroute Variants

Close cousins

Variants with more filter-friendly probe packets

Extensions

GUI (Graphical User Interface) traceroute variants

  • 3d Traceroute plots per-hop RTTs over time in 3D. Works on Windows, free demo available
  • LoriotPro graphical traceroute plugin. Several graphical representations.
  • Path Analyzer Pro, graphical traceroute tool for Windows and MacOS X. Include "synopsis" feature.
  • PingPlotter combines traceroute, ping and whois to collect data for Windows platform.
  • VisualRoute performs traceroutes and provides various graphical representations and analysis. Runs on Windows, Web demo available.

More detailed measurements along the path

Other

  • Multicast Traceroute (mtrace) uses IGMP protocol extensions to allow "traceroute" functionality for multicast
  • Paris Traceroute claims to produce better results in the presence of "multipath" routing.
  • Scamper performs bulk measurements to large numbers of destinations, under configurable traffic limits.
  • tracefilter tries to detect where along a path packets are filtered.
  • Tracebox tries to detect middleboxes along the path
  • traceiface tries to find "both ends" of each link traversed using expanding-ring search.
  • Net::Traceroute::PurePerl is an implementation of traceroute in Perl that lends itself to experimentation.
  • traceflow is a proposed protocol to trace the path for traffic of a specific 3/5-tuple and collect diagnostic information

Another list of traceroute variants, with a focus on (geo)graphical display capabilities, can be found on the JRA1 Wiki.

Traceroute Servers

There are many TracerouteServers on the Internet that allow running traceroute from other parts of the network.

Traceroute Data Collection and Representation

Researchers from the network measurement community have created large collections of traceroute results to help understand the Internet's topology, i.e. the structure of its connectivity. Some of these collections are available for other researchers. Scamper is an example of a tool that can be used to efficiently obtain traceroute results towards a large number of destinations.

The IETF IPPM WG is standardizing an XML-based format to store traceroute results.

-- FrancoisXavierAndreu - SimonMuyal - 06 Jun 2005
-- SimonLeinen - 2005-05-06 - 2015-09-11

Line: 39 to 27
 

LFT (Layer Four Traceroute)

lft is a variant of traceroute that uses per default TCP and port 80 to go through packet-filter based firewalls.

Example

142:/home/andreu# lft -d 80 -m 1 -M 3 -a 5 -c 20 -t 1000 -H 30 -s 53 www.cisco.com     
Tracing _________________________________.
TTL  LFT trace to www.cisco.com (198.133.219.25):80/tcp
1   129.renater.fr (193.49.159.129) 0.5ms
2   gw1-renater.renater.fr (193.49.159.249) 0.4ms
3   nri-a-g13-0-50.cssi.renater.fr (193.51.182.6) 1.0ms
4   193.51.185.1 0.6ms
5   PO11-0.pascr1.Paris.opentransit.net (193.251.241.97) 7.0ms
6   level3-1.GW.opentransit.net (193.251.240.214) 0.8ms
7   ae-0-17.mp1.Paris1.Level3.net (212.73.240.97) 1.1ms
8   so-1-0-0.bbr2.London2.Level3.net (212.187.128.42) 10.6ms
9   as-0-0.bbr1.NewYork1.Level3.net (4.68.128.106) 72.1ms
10   as-0-0.bbr1.SanJose1.Level3.net (64.159.1.133) 158.7ms
11   ge-7-0.ipcolo1.SanJose1.Level3.net (4.68.123.9) 159.2ms
12   p1-0.cisco.bbnplanet.net (4.0.26.14) 159.4ms
13   sjck-dmzbb-gw1.cisco.com (128.107.239.9) 159.0ms
14   sjck-dmzdc-gw2.cisco.com (128.107.224.77) 159.1ms
15   [target] www.cisco.com (198.133.219.25):80 159.2ms

References

  • Home page: http://pwhois.org/lft/
  • Debian package: lft - display the route packets take to a network host/socket

-- FrancoisXavierAndreu & SimonMuyal - 06 Jun 2005
-- SimonLeinen - 21 May 2006

traceproto

Another traceroute variant which allows different protocols and port to be used. "It currently supports tcp, udp, and icmp traces with the possibility of others in the future." It comes with a wrapper script called HopWatcher, which can be used to quickly detect when a path has changed -

Example

142:/home/andreu# traceproto www.cisco.com
traceproto: trace to www.cisco.com (198.133.219.25), port 80
ttl  1:  ICMP Time Exceeded from 129.renater.fr (193.49.159.129)
        6.7040 ms       0.28100 ms      0.28600 ms
ttl  2:  ICMP Time Exceeded from gw1-renater.renater.fr (193.49.159.249)
        0.16900 ms      6.0140 ms       0.25500 ms
ttl  3:  ICMP Time Exceeded from nri-a-g13-0-50.cssi.renater.fr (193.51.182.6)
        6.8280 ms       0.58200 ms      0.52100 ms
ttl  4:  ICMP Time Exceeded from 193.51.185.1 (193.51.185.1)
        6.6400 ms       7.4230 ms       6.7690 ms
ttl  5:  ICMP Time Exceeded from PO11-0.pascr1.Paris.opentransit.net (193.251.241.97)
        0.58100 ms      0.64100 ms      0.54700 ms
ttl  6:  ICMP Time Exceeded from level3-1.GW.opentransit.net (193.251.240.214)
        6.9390 ms       0.62200 ms      6.8990 ms
ttl  7:  ICMP Time Exceeded from ae-0-17.mp1.Paris1.Level3.net (212.73.240.97)
        7.0790 ms       7.0250 ms       0.79400 ms
ttl  8:  ICMP Time Exceeded from so-1-0-0.bbr2.London2.Level3.net (212.187.128.42)
        10.362 ms       10.100 ms       16.384 ms
ttl  9:  ICMP Time Exceeded from as-0-0.bbr1.NewYork1.Level3.net (4.68.128.106)
        109.93 ms       78.367 ms       80.352 ms
ttl  10:  ICMP Time Exceeded from as-0-0.bbr1.SanJose1.Level3.net (64.159.1.133)
        156.61 ms       179.35 ms
          ICMP Time Exceeded from ae-0-0.bbr2.SanJose1.Level3.net (64.159.1.130)
        148.04 ms
ttl  11:  ICMP Time Exceeded from ge-7-0.ipcolo1.SanJose1.Level3.net (4.68.123.9)
        153.59 ms
         ICMP Time Exceeded from ge-11-0.ipcolo1.SanJose1.Level3.net (4.68.123.41)
        142.50 ms
         ICMP Time Exceeded from ge-7-1.ipcolo1.SanJose1.Level3.net (4.68.123.73)
        133.66 ms
ttl  12:  ICMP Time Exceeded from p1-0.cisco.bbnplanet.net (4.0.26.14)
        150.13 ms       191.24 ms       156.89 ms
ttl  13:  ICMP Time Exceeded from sjck-dmzbb-gw1.cisco.com (128.107.239.9)
        141.47 ms       147.98 ms       158.12 ms
ttl  14:  ICMP Time Exceeded from sjck-dmzdc-gw2.cisco.com (128.107.224.77)
        188.85 ms       148.17 ms       152.99 ms
ttl  15:no response     no response     

hop :  min   /  ave   /  max   :  # packets  :  # lost
      -------------------------------------------------------
  1 : 0.28100 / 2.4237 / 6.7040 :   3 packets :   0 lost
  2 : 0.16900 / 2.1460 / 6.0140 :   3 packets :   0 lost
  3 : 0.52100 / 2.6437 / 6.8280 :   3 packets :   0 lost
  4 : 6.6400 / 6.9440 / 7.4230 :   3 packets :   0 lost
  5 : 0.54700 / 0.58967 / 0.64100 :   3 packets :   0 lost
  6 : 0.62200 / 4.8200 / 6.9390 :   3 packets :   0 lost
  7 : 0.79400 / 4.9660 / 7.0790 :   3 packets :   0 lost
  8 : 10.100 / 12.282 / 16.384 :   3 packets :   0 lost
  9 : 78.367 / 89.550 / 109.93 :   3 packets :   0 lost
 10 : 148.04 / 161.33 / 179.35 :   3 packets :   0 lost
 11 : 133.66 / 143.25 / 153.59 :   3 packets :   0 lost
 12 : 150.13 / 166.09 / 191.24 :   3 packets :   0 lost
 13 : 141.47 / 149.19 / 158.12 :   3 packets :   0 lost
 14 : 148.17 / 163.34 / 188.85 :   3 packets :   0 lost
 15 : 0.0000 / 0.0000 / 0.0000 :   0 packets :   2 lost
     ------------------------Total--------------------------
total 0.0000 / 60.540 / 191.24 :  42 packets :   2 lost

References

-- FrancoisXavierAndreu & SimonMuyal - 06 Jun 2005

MTR (Matt's TraceRoute)

mtr (Matt's TraceRoute) combines the functionality of the traceroute and ping programs in a single network diagnostic tool.

Example

mtr.png

References

-- FrancoisXavierAndreu & SimonMuyal - 06 Jun 2005

-- SimonLeinen - 06 May 2005 - 26 Feb 2006

Changed:
<
<

pathping

Seems to be similar to mtr, but for Windows systems. This is included in at least some modern versions of Windows.

References

  • "pathping", Windows XP Professional Product Documentation,
http://www.microsoft.com/resources/documentation/windows/xp/all/proddocs/en-us/pathping.mspx

-- SimonLeinen - 26 Feb 2006

PingPlotter

PingPlotter by Nessoft LLC combines traceroute, ping and whois to collect data for the Windows platform.

Example

A nice example of the tool can be found in the "Screenshot" section of its Web page. This contains not only the screenshot (included below by reference) but also a description of what you can read from it.

nessoftgraph.gif

References

-- FrancoisXavierAndreu & SimonMuyal - 06 Jun 2005
-- SimonLeinen - 26 Feb 2006

Traceroute Mesh Server

Traceroute Mesh combines traceroutes from many sources at once and builds a large map of the interconnections from the Internet to the specific IP. Very cool!

References

Example

Partial view of traceroute map created:

partial.png

-- HankNussbacher - 10 Jul 2005

tracepath

tracepath and tracepath6 trace the path to a network host, discovering MTU and asymmetry along this path. As described below, their applicability for path asymmetry measurements is quite limited, but the tools can still measure MTU rather reliably.

Methodology and Caveats

A path is considered asymmetric if the number of hops to a router is different from how much TTL was decremented while the ICMP error message was forwarded back from the router. The latter depends on knowing what was the original TTL the router used to send the ICMP error. The tools guess TTL values 64, 128 and 255. Obviously, a path might be asymmetric even if the forward and return paths were equally long, so the tool just catches one case of path asymmetry.

A major operational issue with this approach is that at least Juniper's M/T-series routers decrement TTL for ICMP errors they originate (e.g., the first hop router returns ICMP error with TTL=254 instead of TTL=255) as if they were forwarding the packet. This shows as path asymmetry.

Path MTU is measured by sending UDP packets with DF bit set. The packet size is the MTU of the host's outgoing link, which may be cached Path MTU Discovery for a given destination address. If a link MTU is lower than the tried path, the ICMP error tells the new path MTU which is used in subsequent probes.

As explained in tracepath(8), if MTU changes along the path, then the route will probably erroneously be declared as asymmetric.

Examples

IPv4 Example

This example shows a path from a host with 9000-byte "jumbo" MTU support to a host on a traditional 1500-byte Ethernet.

: leinen@cemp1[leinen]; tracepath diotima
 1:  cemp1.switch.ch (130.59.35.130)                        0.203ms pmtu 9000
 1:  swiCE2-G5-2.switch.ch (130.59.35.129)                  1.024ms
 2:  swiLS2-10GE-1-3.switch.ch (130.59.37.2)                1.959ms
 3:  swiEZ2-10GE-1-1.switch.ch (130.59.36.206)              5.287ms
 4:  swiCS3-P1.switch.ch (130.59.36.221)                    5.456ms
 5:  swiCS3-P1.switch.ch (130.59.36.221)                  asymm  4   5.467ms pmtu 1500
 6:  swiLM1-V610.switch.ch (130.59.15.230)                  4.864ms
 7:  swiLM1-V610.switch.ch (130.59.15.230)                asymm  6   5.209ms !H
     Resume: pmtu 1500

The router (interface) swiCS3-P1.switch.ch occurs twice; on the first line (hop 4), it returns an ICMP TTL Exceeded error, on the next (hop 5) it returns an ICMP "fragmentation needed and DF bit set" error. Unfortunately this causes tracepath to miss the "real" hop 5, and also to erroneously assume that the route is asymmetric at that point. One could consider this a bug, as tracepath could distinguish these different ICMP errors, and refrain from incrementing TTL when it reduces MTU (in response to the "fragmentation needed..." error).

When one retries the tracepath, the discovered Path MTU for the destination has been cached by the host, and you get a different result:

: leinen@cemp1[leinen]; tracepath diotima
 1:  cemp1.switch.ch (130.59.35.130)                        0.211ms pmtu 1500
 1:  swiCE2-G5-2.switch.ch (130.59.35.129)                  0.384ms
 2:  swiLS2-10GE-1-3.switch.ch (130.59.37.2)                1.214ms
 3:  swiEZ2-10GE-1-1.switch.ch (130.59.36.206)              4.620ms
 4:  swiCS3-P1.switch.ch (130.59.36.221)                    4.623ms
 5:  swiNM1-G1-0-25.switch.ch (130.59.15.237)               5.861ms
 6:  swiLM1-V610.switch.ch (130.59.15.230)                  4.845ms
 7:  swiLM1-V610.switch.ch (130.59.15.230)                asymm  6   5.226ms !H
     Resume: pmtu 1500

Note that hop 5 now shows up correctly and without an "asymm" warning. There is still an "asymm" warning at the end of the path, because a filter on the last-hop router swiLM1-V610.switch.ch prevents the UDP probes from reaching the final destination.

IPv6 Example

Here is the same path for IPv6, using tracepath6. Because of more relaxed UDP filters, the final destination is actually reached in this case:

: leinen@cemp1[leinen]; tracepath6 diotima
 1?: [LOCALHOST]                      pmtu 9000
 1:  swiCE2-G5-2.switch.ch                      1.654ms
 2:  swiLS2-10GE-1-3.switch.ch                  2.235ms
 3:  swiEZ2-10GE-1-1.switch.ch                  5.616ms
 4:  swiCS3-P1.switch.ch                        5.793ms
 5:  swiCS3-P1.switch.ch                      asymm  4   5.872ms pmtu 1500
 5:  swiNM1-G1-0-25.switch.ch                   5. 47ms
 6:  swiLM1-V610.switch.ch                      5. 79ms
 7:  diotima.switch.ch                          4.766ms reached
     Resume: pmtu 1500 hops 7 back 7

Again, once the Path MTU has been cached, tracepath6 starts out with that MTU, and will discover the correct path:

: leinen@cemp1[leinen]; tracepath6 diotima
 1?: [LOCALHOST]                      pmtu 1500
 1:  swiCE2-G5-2.switch.ch                      0.703ms
 2:  swiLS2-10GE-1-3.switch.ch                  8.786ms
 3:  swiEZ2-10GE-1-1.switch.ch                  4.904ms
 4:  swiCS3-P1.switch.ch                        4.979ms
 5:  swiNM1-G1-0-25.switch.ch                   4.989ms
 6:  swiLM1-V610.switch.ch                      6.578ms
 7:  diotima.switch.ch                          5.191ms reached
     Resume: pmtu 1500 hops 7 back 7

References

-- FrancoisXavierAndreu & SimonMuyal - 06 Jun 2005
-- SimonLeinen - 26 Feb 2006
-- PekkaSavola - 31 Aug 2006

Bandwidth Measurement Tools

(Click on blue headings to link through to detailed tool descriptions and examples)

Iperf

Iperf is a tool to measure maximum TCP bandwidth, allowing the tuning of various parameters and UDP characteristics. Iperf reports bandwidth, delay jitter, datagram loss. http://sourceforge.net/projects/iperf/

BWCTL

"BWCTL is a command line client application and a scheduling and policy daemon that wraps Iperf"

Home page: http://e2epi.internet2.edu/bwctl/

Exemple
http://e2epi.internet2.edu/pipes/pmp/pmp-switch.htm

nuttcp

Measurement tool very similar to iperf. It can be found here http://www.nuttcp.net/nuttcp/Welcome%20Page.html

NDT

Web100-based TCP tester that can be used from a Java applet.

Pchar

Hop-by-hop capacity measurements along a network path.

Netperf

A client/server network performance benchmark.

Home page: http://www.netperf.org/netperf/NetperfPage.html

Thrulay

A tool that performs TCP throughput tests and RTT measurements at the same time.

RUDE/CRUDE

RUDE is a package of applications to generate and measure the UDP traffic between two points. The rude generates traffic to the network, which can be received and logged on the other side of the network with the crude. Traffic pattern can be defined by user. Tool is available under http://rude.sourceforge.net/

NEPIM

nepim stands for network pipemeter, a tool for measuring available bandwidth between hosts. nepim is also useful to generate network traffic for testing purposes.

nepim operates in client/server mode, is able to handle multiple parallel traffic streams, reports periodic partial statistics along the testing, and supports IPv6.

Tool is available under http://www.nongnu.org/nepim/

TTCP

TTCP (Test TCP) is a utility for benchmarking UDP and TCP performance. Utility for Unix and Windows is available at http://www.pcausa.com/Utilities/pcattcp.htm

DSL Reports doesn't require any install. It checks the speed of your connection via a Java applet. One can choose from over 300 sites in the world from which to run the tests http://www.dslreports.com/stest

* Sample report produced by DSL Reports:
dslreport.png

Other online bandwidth measurement sits:

* http://myspeed.visualware.com/

* http://www.toast.net/performance/

* http://www.beelinebandwidthtest.com/

-- FrancoisXavierAndreu, SimonMuyal & SimonLeinen - 06-30 Jun 2005

-- HankNussbacher - 07 Jul 2005 & 15 Oct 2005 (DSL and online Reports section)

pchar

Characterize the bandwidth, latency and loss on network links. (See below the example for information on how pchar works)

(debian package : pchar)

Example:

pchar to 193.51.180.221 (193.51.180.221) using UDP/IPv4
Using raw socket input
Packet size increments from 32 to 1500 by 32
46 test(s) per repetition
32 repetition(s) per hop
 0: 193.51.183.185 (netflow-nri-a.cssi.renater.fr)
    Partial loss:      0 / 1472 (0%)
    Partial char:      rtt = 0.124246 ms, (b = 0.000206 ms/B), r2 = 0.997632
                       stddev rtt = 0.001224, stddev b = 0.000002
    Partial queueing:  avg = 0.000158 ms (765 bytes)
    Hop char:          rtt = 0.124246 ms, bw = 38783.892367 Kbps
    Hop queueing:      avg = 0.000158 ms (765 bytes)
 1: 193.51.183.186 (nri-a-g13-1-50.cssi.renater.fr)
    Partial loss:      0 / 1472 (0%)
    Partial char:      rtt = 1.087330 ms, (b = 0.000423 ms/B), r2 = 0.991169
                       stddev rtt = 0.004864, stddev b = 0.000006
    Partial queueing:  avg = 0.005093 ms (23535 bytes)
    Hop char:          rtt = 0.963084 ms, bw = 36913.554996 Kbps
    Hop queueing:      avg = 0.004935 ms (22770 bytes)
 2: 193.51.179.122 (nri-n3-a2-0-110.cssi.renater.fr)
    Partial loss:      5 / 1472 (0%)
    Partial char:      rtt = 697.145142 ms, (b = 0.032136 ms/B), r2 = 0.999991
                       stddev rtt = 0.011554, stddev b = 0.000014
    Partial queueing:  avg = 0.009681 ms (23679 bytes)
    Hop char:          rtt = 696.057813 ms, bw = 252.261443 Kbps
    Hop queueing:      avg = 0.004589 ms (144 bytes)
 3: 193.51.180.221 (caledonie-S1-0.cssi.renater.fr)
    Path length:       3 hops
    Path char:         rtt = 697.145142 ms r2 = 0.999991
    Path bottleneck:   252.261443 Kbps
    Path pipe:         21982 bytes
    Path queueing:     average = 0.009681 ms (23679 bytes)
    Start time:        Mon Jun  6 11:38:54 2005
    End time:          Mon Jun  6 12:15:28 2005
If you do not have access to a Unix system, you can run pchar remotely via: http://noc.greatplains.net/measurement/pchar.php

The README text below was written by Bruce A Mah and is taken from http://www.kitchenlab.org/www/bmah/Software/pchar/README, where there is information on how to obtain and install pchar.

PCHAR:  A TOOL FOR MEASURING NETWORK PATH CHARACTERISTICS
Bruce A. Mah
<bmah@kitchenlab.org>
$Id: PcharTool.txt,v 1.3 2005/08/09 16:59:27 TobyRodwell Exp www-data $
---------------------------------------------------------

INTRODUCTION
------------

pchar is a reimplementation of the pathchar utility, written by Van
Jacobson.  Both programs attempt to characterize the bandwidth,
latency, and loss of links along an end-to-end path through the
Internet.  pchar works in both IPv4 and IPv6 networks.

As of pchar-1.5, this program is no longer under active development,
and no further releases are planned.

...

A FEW NOTES ON PCHAR'S OPERATION
--------------------------------

pchar sends probe packets into the network of varying sizes and
analyzes ICMP messages produced by intermediate routers, or by the
target host.  By measuring the response time for packets of different
sizes, pchar can estimate the bandwidth and fixed round-trip delay
along the path.  pchar varies the TTL of the outgoing packets to get
responses from different intermediate routers.  It can use UDP or ICMP
packets as probes; either or both might be useful in different
situations.

At each hop, pchar sends a number of packets (controlled by the -R flag)
of varying sizes (controlled by the -i and -m flags).  pchar determines
the minimum response times for each packet size, in an attempt to
isolate jitter caused by network queueing.  It performs a simple
linear regression fit to the resulting minimum response times.  This
fit yields the partial path bandwidth and round-trip time estimates.

To yield the per-hop estimates, pchar computes the differences in the
linear regression parameter estimates for two adjacent partial-path
datasets.  (Earlier versions of pchar differenced the minima for the
datasets, then computed a linear regressions.)  The -a flag selects
between one of (currently) two different algorithms for performing the
linear regression, either a least squares fit or a nonparametric
method based on Kendall's test statistic.

Using the -b option causes pchar to send small packet bursts,
consisting of a string of back-to-back ICMP ECHO_REPLY packets
followed by the actual probe.  This can be useful in probing switched
networks.

CAVEATS
-------

Router implementations may very well forward a packet faster than they
can return an ICMP error message in response to a packet.  Because of
this fact, it's possible to see faster response times from longer
partial paths; the result is a seemingly non-sensical, negative
estimate of per-hop round-trip time.

Transient fluctuations in the network may also cause some odd results.

If all else fails, writing statistics to a file will give all of the
raw data that pchar used for its analysis.

Some types of networks are intrinsically difficult for pchar to
measure.  Two notable examples are switched networks (with multiple
queues at Layer 2) or striped networks.  We are currently
investigating methods for trying to measure these networks.

pchar needs superuser access due to its use of raw sockets.

...

-- FrancoisXavierAndreu & SimonMuyal - 06 Jun 2005

-- HankNussbacher - 10 Jul 2005 (Great Plains server)

-- TobyRodwell - 09 Aug 2005 (added Brudce A Mah's Readme text)

Iperf

Iperf is a tool to measure TCP throughput and available bandwidth, allowing the tuning of various parameters and UDP characteristics. Iperf reports bandwidth, delay variation, and datagram loss.

The popular Iperf 2 releases were developed by NLANR/DAST (http://dast.nlanr.net/Projects/Iperf/) and maintained at http://sourceforge.net/projects/iperf. As of April 2014, the last released version was 2.0.5, from July 2010.

A script that automates the starting and then stopping iperf servers is here . This can be invoked from a remote machine (say a NOC workstation) to simplify starting, and more importantly stopping, an iperf server.

Iperf 3

ESnet and the Lawrence Berkeley National Laboratory have developed a from-scratch reimplementation of Iperf called Iperf 3. It has a Github repository. It is not compatible with iperf 2, and has additional interesting features such as a zero-copy TCP mode (-Z flag), JSON output (-J), and reporting of TCP retransmission counts and CPU utilization (-V). It also supports SCTP in addition to UDP and TCP. Since December 2013, various public releases were made on http://stats.es.net/software/.

Usage Examples

TCP Throughput Test

The following shows a TCP throughput test, which is iperf's default action. The following options are given:

  • -s - server mode. In iperf, the server will receive the test data stream.
  • -c server - client mode. The name (or IP address) of the server should be given. The client will transmit the test stream.
  • -i interval - display interval. Without this option, iperf will run the test silently, and only write a summary after the test has finished. With -i, the program will report intermediate results at given intervals (in seconds).
  • -w windowsize - select a non-default TCP window size. To achieve high rates over paths with a large bandwidth-delay product, it is often necessary to select a larger TCP window size than the (operating system) default.
  • -l buffer length - specify the length of send or receive buffer. In UDP, this sets the packet size. In TCP, this sets the send/receive buffer length (possibly using system defaults). Using this may be important especially if the operating system default send buffer is too small (e.g. in Windows XP).

NOTE -c and -s arguments must be given first. Otherwise some configuration options are ignored.

The -i 1 option was given to obtain intermediate reports every second, in addition to the final report at the end of the ten-second test. The TCP buffer size was set to 2 Megabytes (4 Megabytes effective, see below) in order to permit close to line-rate transfers. The systems haven't been fully tuned, otherwise up to 7 Gb/s of TCP throughput should be possible. Normal background traffic on the 10 Gb/s backbone is on the order of 30-100 Mb/s. Note that in iperf, by default it is the client that transmits to the server.

Server Side:

welti@ezmp3:~$ iperf -s -w 2M -i 1
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 4.00 MByte (WARNING: requested 2.00 MByte)
------------------------------------------------------------
[  4] local 130.59.35.106 port 5001 connected with 130.59.35.82 port 41143
[  4]  0.0- 1.0 sec    405 MBytes  3.40 Gbits/sec
[  4]  1.0- 2.0 sec    424 MBytes  3.56 Gbits/sec
[  4]  2.0- 3.0 sec    425 MBytes  3.56 Gbits/sec
[  4]  3.0- 4.0 sec    422 MBytes  3.54 Gbits/sec
[  4]  4.0- 5.0 sec    424 MBytes  3.56 Gbits/sec
[  4]  5.0- 6.0 sec    422 MBytes  3.54 Gbits/sec
[  4]  6.0- 7.0 sec    424 MBytes  3.56 Gbits/sec
[  4]  7.0- 8.0 sec    423 MBytes  3.55 Gbits/sec
[  4]  8.0- 9.0 sec    424 MBytes  3.56 Gbits/sec
[  4]  9.0-10.0 sec    413 MBytes  3.47 Gbits/sec
[  4]  0.0-10.0 sec  4.11 GBytes  3.53 Gbits/sec

Client Side:

welti@mamp1:~$ iperf -c ezmp3 -w 2M -i 1
------------------------------------------------------------
Client connecting to ezmp3, TCP port 5001
TCP window size: 4.00 MByte (WARNING: requested 2.00 MByte)
------------------------------------------------------------
[  3] local 130.59.35.82 port 41143 connected with 130.59.35.106 port 5001
[  3]  0.0- 1.0 sec    405 MBytes  3.40 Gbits/sec
[  3]  1.0- 2.0 sec    424 MBytes  3.56 Gbits/sec
[  3]  2.0- 3.0 sec    425 MBytes  3.56 Gbits/sec
[  3]  3.0- 4.0 sec    422 MBytes  3.54 Gbits/sec
[  3]  4.0- 5.0 sec    424 MBytes  3.56 Gbits/sec
[  3]  5.0- 6.0 sec    422 MBytes  3.54 Gbits/sec
[  3]  6.0- 7.0 sec    424 MBytes  3.56 Gbits/sec
[  3]  7.0- 8.0 sec    423 MBytes  3.55 Gbits/sec
[  3]  8.0- 9.0 sec    424 MBytes  3.56 Gbits/sec
[  3]  0.0-10.0 sec  4.11 GBytes  3.53 Gbits/sec

UDP Test

In the following example, we send a 300 Mb/s UDP test stream. No packets were lost along the path, although one arrived out-of-order. Another interesting result is jitter, which is displayed as 27 or 28 microseconds (apparently there is some rounding error or other impreciseness that prevents the client and server from agreeing on the value). According to the documentation, "Jitter is the smoothed mean of differences between consecutive transit times."

Server Side

: leinen@mamp1[leinen]; iperf -s -u
------------------------------------------------------------
Server listening on UDP port 5001
Receiving 1470 byte datagrams
UDP buffer size: 64.0 KByte (default)
------------------------------------------------------------
[  3] local 130.59.35.82 port 5001 connected with 130.59.35.106 port 38750
[  3]  0.0-10.0 sec    359 MBytes    302 Mbits/sec  0.028 ms    0/256410 (0%)
[  3]  0.0-10.0 sec  1 datagrams received out-of-order

Client Side

: leinen@ezmp3[leinen]; iperf -c mamp1-eth0 -u -b 300M
------------------------------------------------------------
Client connecting to mamp1-eth0, UDP port 5001
Sending 1470 byte datagrams
UDP buffer size: 64.0 KByte (default)
------------------------------------------------------------
[  3] local 130.59.35.106 port 38750 connected with 130.59.35.82 port 5001
[  3]  0.0-10.0 sec    359 MBytes    302 Mbits/sec
[  3] Sent 256411 datagrams
[  3] Server Report:
[  3]  0.0-10.0 sec    359 MBytes    302 Mbits/sec  0.027 ms    0/256410 (0%)
[  3]  0.0-10.0 sec  1 datagrams received out-of-order

As you would expect, during a UDP test traffic is only sent from the client to the server see here for an example with tcpdump.

Problem isolation procedures using iperf

TCP Throughput measurements

Typically end users are reporting throughput problems as they see on with their applications, like unexpected slow file transfer times. Some users may already report TCP throughput results as measured with iperf. In any case, network administrators should validate the throughput problem. It is recommended this be done using iperf end-to-end measurements in TCP mode between the end systems' memory. The window size of the TCP measurement should follow the bandwidth*delay product rule, and should therefore be set to at least the measured round trip time multiplied by the path's bottle-neck speed. If the actual bottleneck is not known (because of lack on knowledge of the end-to-end path) then it should be assumed the bottleneck is the slowest of the two end-systems' network interface cards.

For instance if one system is connected with Gigabit Ethernet, but the other one with Fast Ethernet and the measured round trip time is 150ms, then the window size should be set to 100 Mbit/s * 0.150s / 8 = 1875000 bytes, so setting the TCP window to a value of 2 MBytes would be a good choice.

In theory the TCP throughput could reach, but not exceed, the available bandwidth on an end-to-end path. The knowledge of that network metric is therefore important for distinguishing between issues with the end system's TCP stacks, or network related problems.

Available bandwidth measurements

Iperf could be used in UDP mode for measuring the available bandwidth. Only short duration measurements in the range of 10 seconds should be done so as not to disturb other production flows. The goal of UDP measurements is to find the maximum UDP sending rate that results in almost no packet loss on the end-to-end path, in good practice the packet loss threshold is 1%. UDP data transfers that results in higher packet losses are likely to disturb TCP production flows and therefore should be avoided. A practicable procedure to find the available bandwidth value is to start with UDP data transfers with a 10s duration and with interim result reports at one second intervals. The data rate to start with should be slightly below the reported TCP throughput. If the measured packet loss values are below the threshold then a new measurement with slightly increased data rate could be started. This procedure of small UDP data transfers with increasing data rate should be repeated until the packet loss threshold is exceeded. Depending on the required result's accuracy further tests can be started beginning with the maximum data rate causing packet losses below the threshold and with smaller data rate increasing intervals. At the end the maximum data rate that caused packet losses below the threshold could be seen as a good measurement of the available bandwidth on the end to end path.

By comparing the reported applications throughput with the measured TCP throughput and the measured available bandwidth, it is possible to distinguish between applications problems, TCP stack problems, or network issues. Note however that differing nature of UDP and TCP flows means that it their measurements should not be directly compared. Iperf sends UDP datagrams are a constant steady rate, whereas TPC tends to send packet trains. This means that TCP is likely to suffer from congestion effects at a lower data rate than UDP.

In case of unexpected low available bandwidth measurements on the end-to-end path, network administrators are interested on the bandwidth bottleneck. The best way to get this value is to retrieve it from passively measured link utilisations and provided capacities on all links along the path. However, if the path is crossing multiple administrative domains this is often not possible because of restrictions in getting those values from other domains. Therefore, it is common practice to use measurement workstations along the end-to-end path, and thus separate the end-to-end path in segments on which available bandwidth measurements are done. This way it is possible to identify the segment on which the bottleneck occurs and to concentrate on that during further troubleshooting procedures.

Other iperf use cases

Besides the capability of measuring TCP throughput and available bandwidth, in UDP mode iperf can report on packet reordering and delay jitter.

Other use cases for measurements using iperf are IPv6 bandwidth measurements and IP multicast performance measurements. More information of the iperf features, its source and binary code for different UNIXes and Microsoft Windows operating systems can be retrieved from the Iperf Web site.

Caveats and Known Issues

Impact on other traffic

As Iperf sends real full data streams it can reduce the available bandwidth on a given path. In TCP mode, the effect to the co-existing production flows should be negligible, assuming the number of production flows is much greater than the number of test data flows, which is normally a valid assumption on paths through a WAN. However, in UDP mode iperf has the potential to disturb production traffic, and in particular TCP streams, if the sender's data rate exceeds the available bandwidth on a path. Therefore, one should take particular care whenever running iperf tests in UDP mode.

TCP buffer allocation

On Linux systems, if you request a specific TCP buffer size with the "-w" option, the kernel will always try to allocate double as much bytes as you specified.

Example: when you request 2MB window size you'll receive 4MB:

welti@mamp1:~$ iperf -c ezmp3 -w 2M -i 1
------------------------------------------------------------
Client connecting to ezmp3, TCP port 5001
TCP window size: 4.00 MByte (WARNING: requested 2.00 MByte)    <<<<<<
------------------------------------------------------------

Counter overflow

Some versions seem to suffer from a 32-bit integer overflow which will lead to wrong results.

e.g.:

[ 14]  0.0-10.0 sec  953315416 Bytes  762652333 bits/sec
[ 14] 10.0-20.0 sec  1173758936 Bytes  939007149 bits/sec
[ 14] 20.0-30.0 sec  1173783552 Bytes  939026842 bits/sec
[ 14] 30.0-40.0 sec  1173769072 Bytes  939015258 bits/sec
[ 14] 40.0-50.0 sec  1173783552 Bytes  939026842 bits/sec
[ 14] 50.0-60.0 sec  1173751696 Bytes  939001357 bits/sec
[ 14]  0.0-60.0 sec  2531115008 Bytes  337294201 bits/sec
[ 14] MSS size 1448 bytes (MTU 1500 bytes, ethernet)

As you can see the summary 0-60 seconds doesn't match the average that one would expect. This is due to the fact that the total number of Bytes is not correct as a result of a counter wrap.

If you're experiencing this kind of effects, upgrade to the latest version of iperf, which should have this bug fixed.

UDP buffer sizing

The UDP buffer sizing (the -w parameter) is also required with high-speed UDP transmissions. Otherwise (typically) the UDP receive buffer will overflow and this will look like packet loss (but looking at tcpdumps or counters reveals you got all the data). This will show in UDP statistics (for example, on Linux with 'netstat -s' under Udp: ... receive packet errors). See more information at: http://www.29west.com/docs/THPM/udp-buffer-sizing.html

Control of measurements

There are two typical deployment scenarios which differ in the kind of access the operator has to the sender and receiver instances. A measurement between well-located measurement workstations within an administrative domain e.g. a campus network allow network administrators full control on the server and client configurations (including test schedules), and allows them to retrieve full measurement results. Measurements on paths that extend beyond the administrative domain borders require access or collaboration with administrators of the far-end systems. Iperf has two features implemented that simplify its use in this scenario, such that the operator does not to need of have an interactive login account on the far-end system:

  • The server instance may run as a daemon (option -D) listening on a configurable transport protocol port, and
  • It is possible to bi-directional tests, either one after the other (option -r) , or simultaneously (option -d).

Screen

Another method of running iperf on a *NIX device is to use 'screen'. Screen is a utility that lets you keep a session running even once you have logged out. It is described more fully here in its man pages, but a simple sequence applicable to iperf would be as follows:

[user@host]$screen -d -m iperf -s -p 5002 
This starts iperf -s -p 5002 as a 'detached' session

[user@host]$screen -ls
There is a screen on:
        25278..mp1      (Detached)
1 Socket in /tmp/uscreens/S-admin.
'screen -ls' shows open sessions.

'screen -r' reconnects to a running session . when in that session keying 'CNTL+a', then 'd' detaches the screen. You can if you wish log out, log back in again, and re-attach. To end the iperf session (and a screen) just hit 'CNTL+c' whilst attached.

Note that BWCTL offers additional control and resource limitation features that make it more suitable for use over administrative domains.

Related Work

Public iperf servers

There used to be some public iperf servers available, but at present none is known anymore. Similar services are provided by BWCTL (see below) and by public NDT servers.

BWCTL

BWCTL (BandWidth test ConTroLler) is a wrapper around iperf that provides scheduling and remote control of measurements.

Instrumented iperf (iperf100)

The iperf code provided by NLANR/DAST was instrumented in order to provide more information to the user. Iperf100 displays various web100 variables at the end of a transfer.

Patches are available at http://www.csm.ornl.gov/~dunigan/netperf/download/

The Instrumented iperf requires machine running a kernel.org linux-2.X.XX kernel with the latest web100 patches applied (http://www.web100.org)

Jperf

Jperf is a Java-based graphical front-end to iperf. It is now included as part of the iperf project on SourceForge. This was once available as a separate project called xjperf, but that seems to have been given up in favor of iperf/SourceForge integration.

iPerf for Android

An Android version of iperf appeared on Google Play (formerly the Android Market) in 2010.

nuttcp

Similar to iperf, but with an additional control connection that makes it somewhat more versatile.

-- FrancoisXavierAndreu & SimonMuyal - 06 Jun 2005
-- HankNussbacher - 10 Jul 2005 (Great Plains server)
-- AnnHarding & OrlaMcGann - Aug 2005 (DS3.3.2 content)
-- SimonLeinen - 08 Feb 2006 (OpenSS7 variant, BWCTL pointer)
-- BartoszBelter - 28 Mar 2006 (iperf100)
-- ChrisWelti - 11 Apr 2006 (examples, 32-bit overflows, buffer allocation)
-- SimonLeinen - 01 Jun 2006 (integrated DS3.3.2 text from Ann and Orla)
-- SimonLeinen - 17 Sep 2006 (tracked iperf100 pointer)
-- PekkaSavola - 26 Mar 2008 (added warning about -c/s having to be first, a common gotcha)
-- PekkaSavola - 05 Jun 2008 (added discussion of '-l' parameter and its significance
-- PekkaSavola - 30 Apr 2009 (added discussion of UDP (receive) buffer sizing significance
-- SimonLeinen - 23 Feb 2012 (removed installation instructions, obsolete public iperf servers)
-- SimonLeinen - 22 May 2012 (added notes about Iperf 3, Jperf and iPerf for Android)
-- SimonLeinen - 01 Feb 2013 (added pointer to Android app; cross-reference to Nuttcp)
-- SimonLeinen - 06 April 2014 (updated Iperf 3 section: now on Github and with releases)
-- SimonLeinen - 05 May 2014 (updated Iperf 3 section: documented more new features)

BWCTL

BWCTL is a command line client application and a scheduling and policy daemon that wraps throughput measurement tools such as iperf, thrulay, and nuttcp (versions prior to 1.3 only support iperf).

More Information: For configuration, common problems, examples etc
Description of bwctld.limits parameters

A typical BWCTL result looks like one of the two print outs below. The first print out shows a test run from the local host to 193.136.3.155 ('-c' stands for 'collector'). The second print out shows a test run to the localhost from 193.136.3.155 ('-s' stands for 'sender').

[user@ws4 user]# bwctl -c 193.136.3.155
bwctl: 17 seconds until test results available
RECEIVER START
3339497760.433479: iperf -B 193.136.3.155 -P 1 -s -f b -m -p 5001 -t 10
------------------------------------------------------------
Server listening on TCP port 5001
Binding to local address 193.136.3.155
TCP window size: 65536 Byte (default)
------------------------------------------------------------
[ 14] local 193.136.3.155 port 5001 connected with 62.40.108.82 port 35360
[ 14]  0.0-10.0 sec  16965632 Bytes  13547803 bits/sec
[ 14] MSS size 1448 bytes (MTU 1500 bytes, ethernet)
RECEIVER END


[user@ws4 user]# bwctl -s 193.136.3.155
bwctl: 17 seconds until test results available
RECEIVER START
3339497801.610690: iperf -B 62.40.108.82 -P 1 -s -f b -m -p 5004 -t 10
------------------------------------------------------------
Server listening on TCP port 5004
Binding to local address 62.40.108.82
TCP window size: 87380 Byte (default)
------------------------------------------------------------
[ 16] local 62.40.108.82 port 5004 connected with 193.136.3.155 port 5004
[ ID] Interval       Transfer     Bandwidth
[ 16]  0.0-10.0 sec  8298496 Bytes  6622281 bits/sec
[ 16] MSS size 1448 bytes (MTU 1500 bytes, ethernet)
[ 16] Read lengths occurring in more than 5% of reads:
[ 16]  1448 bytes read  5708 times (99.8%)
RECEIVER END

The read lengths are shown when the test is for received traffic. In his e-mail to TobyRodwell on 28 March 2005 Stanislav Shalunov of Internet2 writes:

"The values are produced by Iperf and reproduced by BWCTL.

An Iperf server gives you several most frequent values that the read() system call returned during a TCP test, along with the percentages of all read() calls for which the given read() lengths accounted.

This can let you judge the efficiency of interrupt coalescence implementations, kernel scheduling strategies and other such esoteric applesauce. Basically, reads shorter than 8kB can put quite a bit of load on the CPU (each read() call involves several microseconds of context switching). If the target rate is beyond a few hundred megabits per second, one would need to pay attention to read() lengths."

References (Pointers)

-- TobyRodwell - 28 Oct 2005
-- FrancoisXavierAndreu & SimonMuyal - 06 Jun 2005
-- SimonLeinen - 22 Apr 2008 - 07 Dec 2008

Netperf

A client/server network performance benchmark (debian package: netperf).

Example

coming soon

Flent (previously netperf-wrapper)

For his measurement work on fq_codel, Toke Høiland-Jørgensen developed Python wrappers for Netperf and added some tests, notably the RRUL (Realtime Response Under Load) test, which runs bulk TCP transfers in parallel with UDP and ICMP response-time measurements. This can be used to expose Bufferbloat on a path.

These tools used to be known as "netperf-wrappers", but were renamed to Flent (the Flexible Network Tester) in early 2015.

References

-- FrancoisXavierAndreu & SimonMuyal - 2005-06-06
-- SimonLeinen - 2013-03-30 - 2015-06-01

RUDE/CRUDE

RUDE is a package of applications to generate and measure UDP traffic between two endpoints (hosts). The rude tool generates traffic to the network, which can be received and logged on the other side of the network with crude. The traffic pattern can be defined by the user. This tool is available under http://rude.sourceforge.net/

Example

The following rude script will cause a constant-rate packet stream to be sent to a destination for 4 seconds, starting one second (1000 milliseconds) from the time the script was read, stopping five seconds (5000 milliseconds) after the script was read. The script can be called using rude -s example.rude.

START NOW
1000 0030 ON 3002 10.1.1.1:10001 CONSTANT 200 250
5000 0030 OFF

Explanation of the values: 1000 and 5000 are relative timestamps (in milliseconds) for the actions. 0030 is the identification of a traffic flow. At 1000, a CONSTANT-rate flow is started (ON) with source port 3002, destination address 10.1.1.1 and destination port 10001. CONSTANT 200 250 specifies a fixed rate of 200 packets per second (pps) of 250-byte UDP datagrams.

-- FrancoisXavierAndreu & SimonMuyal - 06 Jun 2005
-- SimonLeinen - 28 Sep 2008

TTCP

TTCP (Test TCP) is a utility for benchmarking UDP and TCP performance. Utility for Unix and Windows is available at http://www.pcausa.com/Utilities/pcattcp.htm/

-- FrancoisXavierAndreu & SimonMuyal - 06 Jun 2005

>
>
 

NDT (Network Diagnostic Tool)

NDT can be used to check the TCP configuration of any host that can run Java applets. The client connects to a Web page containing a special applet. The Web server that serves the applet must run a kernel with the Web100 extensions for TCP measurement instrumentation. The applet performs TCP memory-to-memory tests between the client and the Web100-instrumented server, and then uses the measurements from the Web100 instrumentation to find out about the TCP configuration of the client. NDT also detects common configuration errors such as duplex mismatch.

Besides the applet, there is also a command-line client called web100clt that can be used without a browser. The client works on Linux without any need for Web100 extensions, and can be compiled on other Unix systems as well. It doesn't require a Web server on the server side, just the NDT server web100srv - which requires a Web100-enhanced kernel.

Applicability

Since NDT performs memory-to-memory tests, it avoids end-system bottlenecks such as file system or disk limitations. Therefore it is useful for estimating "pure" TCP throughput limitations for a system, better than, for instance, measuring the throughput of a large file retrieval from a public FTP or WWW server. In addition, NDT servers are supposed to be well-connected and lightly loaded.

When trying to tune a host for maximum throughput, it is a good idea to start testing it against an NDT server that is known to be relatively close to the host in question. Once the throughput with this server is satisfactory, one can try to further tune the host's TCP against a more distant NDT server. Several European NRENs as well as many sites in the U.S. operate NDT servers.

Example

Applet (tcp100bw.html)

This example shows the results (overview) of an NDT measurement between a client in Switzerland (Gigabit Ethernet connection) and an IPv6 NDT server in Ljubljana, Slovenia (Gigabit Ethernet connection).

Screen_shot_2011-07-25_at_18.11.44_.png

Command-line client

The following is an example run of the web100clt command-line client application. In this case, the client runs on a small Linux-based home router connected to a commercial broadband-over-cable Internet connection. The connection is marketed as "5000 kb/s downstream, 500 kb/s upstream", and it can be seen that this nominal bandwidth can in fact be achieved for an individual TCP connection.

$ ./web100clt -n ndt.switch.ch
Testing network path for configuration and performance problems  --  Using IPv6 address
Checking for Middleboxes . . . . . . . . . . . . . . . . . .  Done
checking for firewalls . . . . . . . . . . . . . . . . . . .  Done
running 10s outbound test (client to server) . . . . .  509.00 kb/s
running 10s inbound test (server to client) . . . . . . 5.07 Mb/s
Your host is connected to a Cable/DSL modem
Information [C2S]: Excessive packet queuing detected: 10.18% (local buffers)
Information [S2C]: Excessive packet queuing detected: 73.20% (local buffers)
Server 'ndt.switch.ch' is not behind a firewall. [Connection to the ephemeral port was successful]
Client is probably behind a firewall. [Connection to the ephemeral port failed]
Information: Network Middlebox is modifying MSS variable (changed to 1440)
Server IP addresses are preserved End-to-End
Client IP addresses are preserved End-to-End

Note the remark "-- Using IPv6 address". The example used the new version 3.3.12 of the NDT software, which includes support for both IPv4 and IPv6. Because both the client and the server support IPv6, the test is run over IPv6. To run an IPv4 test in this situation, one could have called the client with the -4 option, i.e. web100clt -4 -n ndt.switch.ch.

Public Servers

There are many public NDT server installations available on the Web. You should choose between the servers according to distance (in terms of RTT) and the throughput range you are interested in. The limits of your local connectivity can best be tested by connecting to a server with a fast(er) connection that is close by. If you want to tune your TCP parameters for "LFNs", use a server that is far away from you, but still reachable over a path without bottlenecks.

Development

Development of NDT is now hosted on Google Code, and as of July 2011 there is some activity. Notably, there is now documentation about the architecture as well as the detailed protocol or NDT. One of the stated goals is to allow outside parties to write compatible NDT clients without consulting the NDT source code.

References

-- SimonLeinen - 30 Jun 2005 - 22 Nov 2011

Active Measurement Tools

Active measurement injects traffic into a network to measure properties about that network.

  • ping - a simple RTT and loss measurement tool using ICMP ECHO messages
  • fping - ping variant supporting concurrent measurement to multiple destinations
  • SmokePing - nice graphical representation of RTT distribution and loss rate from periodic "pings"
  • OWAMP - a protocol and tool for one-way measurements

There are several permanent infrastructures performing active measurements:

  • HADES DFN-developed HADES Active Delay Evaluation System, deployed in GEANT/GEANT2 and DFN
  • RIPE TTM
  • QoSMetricsBoxes used in RENATER's Métrologie project

-- SimonLeinen - 29 Mar 2006

ping

Ping sends ICMP echo request messages to a remote host and waits for ICMP echo replies to determine the latency between those hosts. The output shows the Round Trip Time (RTT) between the host machine and the remote target. Ping is often used to determine whether a remote host is reachable. Unfortunately, it is quite common these days for ICMP traffic to be blocked by packet filters / firewalls, so a ping timing out does not necessarily mean that a host is unreachable.

(Another method for checking remote host availability is to telnet to a port that you know to be accessible; such as port 80 (HTTP) or 25 (SMTP). If the connection is still timing out, then the host is probably not reachable - often because of a filter/firewall silently blocking the traffic. When there is no service listening on that port, this usually results in a "Connection refused" message. If a connection is made to the remote host, then it can be ended by typing the escape character (usually 'CTRL ^ ]') then quit.)

The -f flag may be specified to send ping packets as fast as they come back, or 100 times per second - whichever is more frequent. As this option can be very hard on the network only a super-user (the root account on *NIX machines) is allowed to specified this flag in a ping command.

The -c flag specifies the number of Echo request messages sent to the remote host by ping. If this flag isn't used, then the ping continues to send echo request messages until the user types CTRL C. If the ping is cancelled after only a few messages have been sent, the RTT summary statistics that are displayed at the end of the ping output may not have finished being calculated and won't be completely accurate. To gain an accurate representation of the RTT, it is recommended to set a count of 100 pings. The MS Windows implementation of ping just sends 4 echo request messages by default.

Ping6 is the IPv6 implementation of the ping program. It works in the same way, but sends ICMPv6 Echo Request packets and waits for ICMPv6 Echo Reply packets to determine the RTT between two hosts. There are no discernable differences between the Unix implementations and the MS Windows implementation.

As with traceroute, Solaris integrates IPv4 and IPv6 functionality in a single ping binary, and provides options to select between them (-A <AF>).

See fping for a ping variant that supports concurrent measurements to multiple destinations and is easier to use in scripts.

Implementation-specific notes

-- TobyRodwell - 06 Apr 2005
-- MarcoMarletta - 15 Nov 2006 (Added the difference in size between Juniper and Cisco)

fping

fping is a ping like program which uses the Internet Control Message Protocol (ICMP) echo request to determine if a host is up. fping is different from ping in that you can specify any number of hosts on the command line, or specify a file containing the lists of hosts to ping. Instead of trying one host until it timeouts or replies, fping will send out a ping packet and move on to the next host in a round-robin fashion. If a host replies, it is noted and removed from the list of hosts to check. If a host does not respond within a certain time limit and/or retry limit it will be considered unreachable.

Unlike ping, fping is meant to be used in scripts and its output is easy to parse. It is often used as a probe for packet loss and round-trip time in SmokePing.

fping version 3

The original fping was written by Roland Schemers in 1992, and stopped being updated in 2002. In December 2011, David Schweikert decided to take up maintainership of fping, and increased the major release number to 3, mainly to reflect the change of maintainer. Changes from earlier versions include:

  • integration of all Debian patches, including IPv6 support (2.4b2-to-ipv6-16)
  • an optimized main loop that is claimed to bring performance close to the theoretical maximum
  • patches to improve SmokePing compatibility

The Web site location has changed to fping.org, and the source is now maintained on GitHub.

Since July 2013, there is also a Mailing List, which will be used to announce new releases.

Examples

Simple example of usage:

# fping -c 3 -s www.man.poznan.pl www.google.pl
www.google.pl     : [0], 96 bytes, 8.81 ms (8.81 avg, 0% loss)
www.man.poznan.pl : [0], 96 bytes, 37.7 ms (37.7 avg, 0% loss)
www.google.pl     : [1], 96 bytes, 8.80 ms (8.80 avg, 0% loss)
www.man.poznan.pl : [1], 96 bytes, 37.5 ms (37.6 avg, 0% loss)
www.google.pl     : [2], 96 bytes, 8.76 ms (8.79 avg, 0% loss)
www.man.poznan.pl : [2], 96 bytes, 37.5 ms (37.6 avg, 0% loss)

www.man.poznan.pl : xmt/rcv/%loss = 3/3/0%, min/avg/max = 37.5/37.6/37.7
www.google.pl     : xmt/rcv/%loss = 3/3/0%, min/avg/max = 8.76/8.79/8.81

       2 targets
       2 alive
       0 unreachable
       0 unknown addresses

       0 timeouts (waiting for response)
       6 ICMP Echos sent
       6 ICMP Echo Replies received
       0 other ICMP received

 8.76 ms (min round trip time)
 23.1 ms (avg round trip time)
 37.7 ms (max round trip time)
        2.039 sec (elapsed real time)

IPv6 Support

Jeroen Massar has added IPv6 support to fping. This has been implemented as a compile-time variant, so that there are separate fping (for IPv4) and fping6 (for IPv6) binaries. The IPv6 patch has been partially integrated into the fping version on www.fping.com as of release "2.4b2_to-ipv6" (thus also integrated in fping 3.0). Unfortunately his modifications to the build routine seem to have been lost in the integration, so that the fping.com version only installs the IPv6 version as fping. Jeroen's original version doesn't have this problem, and can be downloaded from his IPv6 Web page.

ICMP Sequence Number handling

Older versions of fping used the Sequence Number field in ICMP ECHO requests in a peculiar way: it used a different sequence number for each destination host, but used the same sequence number for all requests to a specific host. There have been reports of specific systems that suppress (or rate-limit) ICMP ECHO requests with repeated sequence numbers, which causes high loss rates reported from tools that use fping, such as SmokePing. Another issue is that fping could not distinguish a perfect link from one that drops every other packet and that duplicates every other.

Newer fping versions such as 3.0 or 2.4.2b2_to (on Debian GNU/Linux) include a change to sequence number handling attributed to Stephan Fuhrmann. These versions increment sequence numbers for every probe sent, which should solve both of these problems.

References

-- BartoszBelter - 2005-07-14 - 2005-07-26
-- SimonLeinen - 2008-05-19 - 2013-07-26

Changed:
<
<

SmokePing

Smokeping is a software-based measurement framework that uses various software modules (called probes) to measure round trip times, packet losses and availability at layer 3 (IPv4 and IPv6) and even applications? latencies. Layer 3 measurements are based on the ping tool and for analysis of applications there are such probes as for measuring DNS lookup and RADIUS authentication latencies. The measurements are centrally controlled on a single host from which the software probes are started, which in turn emit active measurement flows. The load impact to the network by these streams is usually negligible.

As with MRTG the results are stored in RRD databases in the original polling time intervals for 24 hours and then aggregated over time, and the peak values and mean values on a 24h interval are stored for more than one year. Like MRTG, the results are usually displayed in a web browser, using html embedded graphs in daily, monthly, weekly and yearly timeframes. A particular strength of smoke ping is the graphical manner in which it displays the statistical distribution of latency values over time.

The tool has also an alarm feature that, based on flexible threshold rules, either sends out emails or runs external scripts.

The following picture shows an example output of a weekly graph. The background colour indicates the link availability, and the foreground lines display the mean round trip times. The shadows around the lines indicate graphically about the statistical distribution of the measured round-trip times.

  • Example SmokePing graph:
    Example SmokePing graph

The 2.4* versions of Smokeping included SmokeTrace, an AJAX-based traceroute tool similar to mtr, but browser/server based. This was removed in 2.5.0 in favor of a separate, more general, tool called remOcular.

In August 2013, Tobi Oetiker announced that he received funding for the development of SmokePing 3.0. This will use Extopus as a front-end. This new version will be developed in public on Github. Another plan is to move to an event-based design that will make Smokeping more efficient and allow it to scale to a large number of probes.

Related Tools

The OpenNMS network management platform includes a tool called StrafePing which is heavily inspired by SmokePing.

References

-- SimonLeinen - 2006-04-07 - 2013-11-03

OWAMP (One-Way Active Measurement Tool)

OWAMP is a command line client application and a policy daemon used to determine one way latencies between hosts. It is an implementation of the OWAMP protocol (see references) that is currently going through the standardization process within the IPPM WG in the IETF.

With roundtrip-based measurements, it is hard to isolate the direction in which congestion is experienced. One-way measurements solve this problem and make the direction of congestion immediately apparent. Since traffic can be asymmetric at many sites that are primarily producers or consumers of data, this allows for more informative measurements. One-way measurements allow the user to better isolate the effects of specific parts of a network on the treatment of traffic.

The current OWAMP implementation (V3.1) supports IPv4 and IPv6.

Example using owping (OWAMP v3.1) on a linux host:

welti@atitlan:~$ owping ezmp3
Approximately 13.0 seconds until results available

--- owping statistics from [2001:620:0:114:21b:78ff:fe30:2974]:52887 to [ezmp3]:41530 ---
SID:    feae18e9cd32e3ee5806d0d490df41bd
first:  2009-02-03T16:40:31.678
last:   2009-02-03T16:40:40.894
100 sent, 0 lost (0.000%), 0 duplicates
one-way delay min/median/max = 2.4/2.5/3.39 ms, (err=0.43 ms)
one-way jitter = 0 ms (P95-P50)
TTL not reported
no reordering


--- owping statistics from [ezmp3]:41531 to [2001:620:0:114:21b:78ff:fe30:2974]:42315 ---
SID:    fe302974cd32e3ee7ea1a5004fddc6ff
first:  2009-02-03T16:40:31.479
last:   2009-02-03T16:40:40.685
100 sent, 0 lost (0.000%), 0 duplicates
one-way delay min/median/max = 1.99/2.1/2.06 ms, (err=0.43 ms)
one-way jitter = 0 ms (P95-P50)
TTL not reported
no reordering

References

-- ChrisWelti - 03 Apr 2006 - 03 Feb 2009 -- SimonLeinen - 29 Mar 2006 - 01 Dec 2008

Active Measurement Boxes

-- FrancoisXavierAndreu & SimonLeinen - 06 Jun 2005 - 06 Jul 2005 Warning: Can't find topic PERTKB.IppmBoxes

RIPE TTM Boxes

Summary

The Test Traffic Measurements service has been offered by the RIPE NCC as a service since the year 2000. The system continuously records one-way delay and packet-loss measurements as well as router-level paths ("traceroutes") between a large set of probe machines ("test boxes"). The test boxes are hosted by many different organisations, including NRENs, commercial ISPs, universities and others, and usually maintained by RIPE NCC as part of the service. While the vast majority of the test boxes is in Europe, there are a couple of machines in other continents, including outside the RIPE NCC's service area. Every test box includes a precise time source - typically a GPS receiver - so that accurate one-way delay measurements can be provided. Measurement data are entered in a central database at RIPE NCC's premises every night. The database is based on CERN's ROOT system. Measurement results can be retrieved over the Web using various presentations, both pre-generated and "on-demand".

Applicability

RIPE TTM data is often useful to find out "historical" quality information about a network path of interest, provided that TTM test boxes are deployed near (in the topological sense, not just geographically) the ends of the path. For research network paths throughout Europe, the coverage of the RIPE TTM infrastructure is not complete, but quite comprehensive. When one suspects that there is (non-temporary) congestion on a path covered by the RIPE TTM measurement infrastructure, the TTM graphs can be used to easily verify that, because such congestion will show up as delay and/or loss if present.

The RIPE TTM system can also be used by operators to precisely measure - but only after the fact - the impact of changes in the network. These changes can include disturbances such as network failures or misconfigurations, but also things such as link upgrades or routing improvements.

Notes

The TTM project has provided extremely high-quality and stable delay and loss measurements for a large set of network paths (IPv4 and recently also IPv6) throughout Europe. These paths cover an interesting mix of research and commercial networks. The Web interface to the collected measurements supports in-depth exploration quite well, such as looking at the delay/loss evolution of a specific path over both a wide range of intervals from the very short to the very long. On the other hand, it is hard to get at useful "overview" pictures. The RIPE NCC's DNS Server Monitoring (DNSMON) service provides such overviews for specific sets of locations.

The raw data collected by the RIPE TTM infrastructure is not generally available, in part because of restrictions on data disclosure. This is understandable in that RIPE NCC's main member/customer base consists of commercial ISPs, who presumably wouldn't allow full disclosure for competitive reasons. But it also means that in practice, only the RIPE NCC can develop new tools that make use of these data, and their resources for doing so are limited, in part due to lack of support from their ISP membership. On a positive note, the RIPE NCC has, on several occasions, given scientists access to the measurement infrastructure for research. The TTM team also offers to extract raw data (in ROOT format) from the TTM database manually on demand.

Another drawback of RIPE TTM is that the central TTM database is only updated once a day. This makes the system impossible to use for near-real-time diagnostic purposes. (For test box hosts, it is possible to access the data collected by one's hosted test boxes in near realy time, but this provides only very limited functionality, because only information about "inbound" packets is available locally, and therefore one can neither correlate delays with path changes, nor can one compute loss figures without access to the sending host's data.) In contrast, RIPE NCC's Routing Information Service (RIS) infrastructure provides several ways to get at the collected data almost as it is collected. And because RIS' raw data is openly available in a well-defined and relatively easy-to-use format, it is possible for third parties to develop innovative and useful ways to look at that data.

Examples

The following screenshot shows an on-demand graph of the delays over a 24-hour period from one test box hosted by SWITCH (tt85 at ETH Zurich) to to other one (tt86 at the University of Geneva). The delay range has been narrowed so that the fine-grained distribution can be seen. Five packets are listed as "overflow" under "STATISTICS - Delay & Hops" because their one-way delays exceeded the range of the graph. This is an example of an almost completely uncongested path, showing no packet loss and the vast majority of delays very close to the "baseline" value.

ripe-ttm-sample-delay-on-demand.png

The following plot shows the impact of a routing change on a path to a test box in New Zealand (tt47). The hop count decreases by one, but the base one-way delay increased by about 20ms. The old and new routes can be retrieved from the Web interface too. Note also that the delay values are spread out over a fairly large range, which indicates that there is some congestion (and thus queueing) on the path. The loss rate over the entire day is 1.1%. It is somewhat interesting how much of this is due to the routing change and how much (if any) is due to "normal" congestion. The lower right graph shows "arrived/lost" packets over time and indicates that most packets were in fact lost during the path change - so although some congestion is visible in the delays, it is not severe enough to cause visible loss. On the other hand, the total number of probe packets sent (about 2850 per day) means that congestion loss would probably go unnoticed until it reaches a level of 0.1% or so.

day.tt85.tt47.gif

References

-- FrancoisXavierAndreu & SimonLeinen - 06 Jun 2005 - 06 Jul 2005

QoSMetrics Boxes

This measurement infrastructure was built as part of RENATER's Métrologie efforts. The infrastructure performs continuous delay measurement between a mesh of measurement points on the RENATER backbone. The measurements are sent to a central server every minute, and are used to produce a delay matrix and individual delay histories (IPv4 and IPv6 measures, soon for each CoS)).

Please note that all results (tables and graphs) presented on this page are genereted by RENATER's scripts from QoSMetrics MySQL Database.

Screendump from RENATER site (http://pasillo.renater.fr/metrologie/get_qosmetrics_results.php)

new_Renater_IPPM_results_screendump.jpg

Example of asymmetric path:

(after that the problem was resolve, only one path has his hop number decreased)

Graphs:

Paris_Toulouse_delay Toulouse_Paris_delay
Paris_Toulouse_jitter Toulouse_Paris_jitter
Paris_Toulouse_hop_number Toulouse_Paris_hop_number
Paris_Toulouse_pktsLoss Toulouse_Paris_pktsLoss

Traceroute results:

traceroute to 193.51.182.xx (193.51.182.xx), 30 hops max, 38 byte packets
1 209.renater.fr (195.98.238.209) 0.491 ms 0.440 ms 0.492 ms
2 gw1-renater.renater.fr (193.49.159.249) 0.223 ms 0.219 ms 0.252 ms
3 nri-c-g3-0-50.cssi.renater.fr (193.51.182.6) 0.726 ms 0.850 ms 0.988 ms
4 nri-b-g14-0-0-101.cssi.renater.fr (193.51.187.18) 11.478 ms 11.336 ms 11.102 ms
5 orleans-pos2-0.cssi.renater.fr (193.51.179.66) 50.084 ms 196.498 ms 92.930 ms
6 poitiers-pos4-0.cssi.renater.fr (193.51.180.29) 11.471 ms 11.459 ms 11.354 ms
7 bordeaux-pos1-0.cssi.renater.fr (193.51.179.254) 11.729 ms 11.590 ms 11.482 ms
8 toulouse-pos1-0.cssi.renater.fr (193.51.180.14) 17.471 ms 17.463 ms 17.101 ms
9 xx.renater.fr (193.51.182.xx) 17.598 ms 17.555 ms 17.600 ms

[root@CICT root]# traceroute 195.98.238.xx
traceroute to 195.98.238.xx (195.98.238.xx), 30 hops max, 38 byte packets
1 toulouse-g3-0-20.cssi.renater.fr (193.51.182.202) 0.200 ms 0.189 ms 0.111 ms
2 bordeaux-pos2-0.cssi.renater.fr (193.51.180.13) 16.850 ms 16.836 ms 16.850 ms
3 poitiers-pos1-0.cssi.renater.fr (193.51.179.253) 16.728 ms 16.710 ms 16.725 ms
4 nri-a-pos5-0.cssi.renater.fr (193.51.179.17) 22.969 ms 22.956 ms 22.972 ms
5 nri-c-g3-0-0-101.cssi.renater.fr (193.51.187.21) 22.972 ms 22.961 ms 22.844 ms
6 gip-nri-c.cssi.renater.fr (193.51.182.5) 17.603 ms 17.582 ms 17.616 ms
7 250.renater.fr (193.49.159.250) 17.836 ms 17.718 ms 17.719 ms
8 xx.renater.fr (195.98.238.xx) 17.606 ms 17.707 ms 17.226 ms

-- FrancoisXavierAndreu & SimonLeinen - 06 Jun 2005 - 06 Jul 2005 - 07 Apr 2006

Passive Measurement Tools

Network Usage:

  • SNMP-based tools retrieve network information such as state and utilization of links, router CPU loads, etc.: MRTG, Cricket
  • Netflow-based tools use flow-based accounting information from routers for traffic analysis, detection of routing and security problems, denial-of-service attacks etc.

Traffic Capture and Analysis Tools

-- FrancoisXavierAndreu - 06 Jun 2005 -- SimonLeinen - 05 Jan 2006

SNMP-based tools

The Simple Network Management Protocol (SNMP) is widely used for device monitoring over IP, especially for monitoring infrastructure compoments of those networks, such as routers, switches etc. There are many tools that use SNMP to retrieve network information such as the state and utilization of links, routers' CPU load, etc., and generate various types of visualizations, other reports, and alarms.

  • MRTG (Multi-Router Traffic Grapher) is a widely-used open-source tool that polls devices every five minutes using SNMP, and plots the results over day, week, month, and year timescales.
  • Cricket serves the same purpose as MRTG, but is configured in a different way - targets are organized in a tree, which allows inheritance of target specifications. Cricket has been using RRDtool from the start, although MRTG has picked that up (as an option) as well.
  • RRDtool is a round-robin database used to store periodic measurements such as SNMP readings. Recent values are stored at finer time granularity than older values, and RRDtool incrementally "consolidates" values as they age, and uses an efficient constant-size representation. RRDtool is agnostic to SNMP, but it is used by Cricket, MRTG (optionally), and many other SNMP tools.
  • Synagon is a Python-based tool that performs SNMP measurements (Juniper firewall counters) at much shorter timescales (seconds) and graphs them in real time. This can be used to look at momentary link utilizations, within the limits of how often devices actually update their SNMP values.

-- FrancoisXavierAndreu - 06 Jun 2005 -- SimonLeinen - 05 Jan 2006 - 16 Jul 2007

Synagon SNMP Graphing Tool

Overview

Synagon is Python-based program that collects and graphs firewall values from Juniper routers. The tool first requires the user to select a router from a list. It will retrieve all the available firewall filters/counters from that router and the user can select any of these to graph. It was developed by DANTE for internal use, but could be made available on request (on a case by case basis), in an entirely unsupported manner. Please contact operations@dante.org.uk for more details.

[The information below is taken from DANTE's internal documentation.]

Installation and Setup procedures

Synagon comes as a single compressed archive, that can be copied in a directory and then invoked as any other standard python script.

It requires the following python libraries to operate:

1) The python SNMP framework (available at http://pysnmp.sourceforge.net)

The SNMP framework is required to SNMP to the routers. For guidance how to install the packages please see the section below

You also need to create a network.py file that describes the network topology where the tool will act upon. The configuration files should be located in the same directory as the tool itself. Please look at the section below for more information on the format of the file.

  • Example Sysngaon screen:
    Example Sysngaon screen

How to use Synagon

The tool can be invoked by double clicking on the file name (under windows) or by typing the following on the command shell:

python synagon.py

Prior to that, a network.py configuration file must have been created and located at the directory where the tool has been invoked. It describes the location of the routers and the snmp community to use when contacting them.

Immediately after, a window with a single menu entry will appear. By clicking on the menu, a list of all routers described in the network.py will be given as submenus.

Under each router, a submenu shows the list of firewall filters for that router (depending whether you have previously collected the list) together with a ‘collect’ submenu. The collect submenu initiates the collection of the firewall filters from the router and then updates the list of firewall filters in the submenu where the collect button is located. The list of firewall filters of each router is also stored on the local disk and next time the tool is invoked it won’t be necessary to interrogate the router for receiving the firewall list, but they will be taken instead from the file. When the firewall filter list is retrieved, a submenu with all counter and policers for each filter will be displayed. A policer is marked with [p] and a counter with a [c] just after their name.

When a counter or policer of a firewall filter of a router is selected, the SNMP poling for that entity begins. After a few seconds, a window will appear that plots the requested values.

The graph shows the rate of bytes dropped since monitoring had started for a particular counter or the rate of packets dropped since monitoring had started for a particular policer. The save graph button can create a file for the graph in encapsulated postscript format (.eps).

Because values are frequently retrieved, there are memory consumption and performance concerns. To avoid this a configurable limit is placed on the number of samples plotted on graph. The handle at the top of the window limits the number of samples plotted and has an impact on performance if is positioned at a high scale (of thousands).

Developer’s guide

This section provides information about how the tool internally works.

The tool mainly consists of the felektis.py file, but it also relies on the external csnmp.py and graph.py library. The felektis.py defines two threads of control; one initiates the application and is responsible for collecting the results, the other is responsible for the visual part of the application and the interaction with the user.

The GUI thread updates a shared structure (a dictionary) with contact information about all the firewalls the user has chosen by that time to monitor.

active_firewall_list = {}

The structure holds the following information:

router_name->filter_name->counter_name(collect, instance, type, bytes, epoch)

There is an entry per counter and that entry has information about whether the counter should be retrieved (collect), the SNMP instance (instance) for that counter, whether it is a policer or a counter (type), the last byte counter value (bytes), the time the last value was received (epoch).

It operates as follows:

do forever:
   if the user has shutdown the GUI thread:
      terminate

   for each router in active_firewall_list:
      for each filter of each router:
         for each counter/policer of each filter:
            retrieve counter value
            calculate change rate 
            pass rate of that counter to the GUI thread

It uses SNMP to request the values from the router.

The gui thread (MainWindow class) first creates the user interfaces (based on the Tkinter Tcl/TK module) and is passed the geant network topology found in the network.py file. It then creates a menu list with all routers found in the topology. Each router entry has a submenu with an item of ‘collect’. This item is bound to a function and if it is selected, the router is interrogated via SNMP on the Juniper Firewall MIB, and the collection of filter name/counter names for that router is retrieved. Because the interrogation may take tens of seconds, the firewall filter list is serialised onto the local directory using the cPickle builtin module. The file is given the name router_name.filter (e.g.: uk1.filters for router uk1) and it is stored in the local directory where the tool was invoked from. Next time the tool is invoked, it will check if there is a file list for that router, and if so, it will deserialise the list of firewall filters from the file and populate the routers’ submenu. It could of course be the case the serialised firewall list can be out of date, though the ‘collect’ submenu entry for each router is always there and can be invoked to replace the current list (both on memory and disk) with the latest details. If the user selects a firewall or policer to monitor from the submenu list, it will populate the active_firewall_list with that counter/policer, and the main thread will take it up and begin retrieving data for that counter/policer.

The main thread retrieves and passes the calculated rate to the MainWindow thread onto a queue that the main thread populates and the gui thread gets its data from. Because the updates can come from a variety of different firewall filters the queue is indexed with a text string which is a unique identifier based upon router name, firewall filter name, counter/policer name. The gui thread periodically (i.e.: every 250ms) check if there are data in the above mentioned queue, If the identifier hasn’t been seen before, a new window is created where the values are plotted. If the window already exists, then it is updated with that value and its contents are redrawn. A window is an instance of the class PlotList.

When a window is closed by the user, the gui modifies the active_firewall_list for that policer by setting the ‘collect’ attribute to None, and the main thread next time will reach that counter/policer, it will ignore it.

Creating a single synagon package

The tool consists of several files:

  • felegktis.py [main logic]
  • graph.py [graph library]
  • csnmp.py [ SNMP library]

It is possible by using the squeeze tool (available at http://www.pythonware.com/products/python/squeeze/index.htm) to make this file into a compressed python executable archive.

The build_synagon.bat uses the squeeze tool to archive all this file into one.

build_synagon.bat:
REM build the sinagon distribution
python squeezeTool.py -1 -o synagon -b felegktis felegktis.py graph.py csnmp.py

The command above builds synagon.py from files felegktis.py, graph.py and csnmp.py.

Appendices

A. network.py

newtork.py descibes the nework's topology - its format can be deduced by inspection. Currently the network topology name (graph name) should be set to ‘geant’ - the script could be adjusted to change this behaviour.

Compulsory router properties:

  • type: the type of the router [juniper, cisco, etc] etc (the tool only operates on ‘juniper’ routers
  • address: The address where the router can be contacted at
  • community: the SNMP community to authenticate on the router
Example:
# Geant network
geant = graph.Graph()

# Defaults for router
class Router(graph.Vertex):
    """ Common options for geant routers """

    def __init__(self):
        # Initialise base class
        graph.Vertex.__init__(self)
        # Set the default properties for these routers
      self.property['type'] = 'juniper'
        self.property['community'] = 'commpassword'

uk1 = Router()
gr1 = Router()


geant.vertex['uk1'] = uk1
geant.vertex['gr1'] = gr1

uk1.property['address'] = 'uk1.uk.geant.net'
gr1.property['address'] = 'gr1.gr.geant.net'

B. Python Package Installation

Python package installation

Python has a built in support for installing packages.

Packages usually come in a source format and they compiled at the time of the installation. Packages can be complete python source code or have extension in other languages (c, c++) . The packages can also come in binary form.

Many python packages which have extension written in some oher language, need to be compiled into a binary package before distributing to platform like windows, because not all windows machine have c or c++ compilers that a package may require.

The Python SNMP framework library

The source version can be downloaded from http://mysnmp.sourceforge.net

When the file is uncompressed, the user should invoke the setup.py:

setup.py install There is also a binary version for Windows that exists in the NEP’s wiki. It has a simple GUI. Just keep pushing the ‘next’ button until the package is installed.

-- TobyRodwell - 23 Jan 2006

Netflow-based Tools

Netflow: Analysis tools to characterize the traffic in the network, detect routing problems, security problems (Denials of Service), etc.

There are different versions of NDE (NetFlow Data Export): v1 (first), v5, v7, v8 and the last version (v9) which is based on templates. This last version makes it possible to analyze IPv6, MPLS and Multicast flows.

  • NDE v9 description:
    NDE v9 description

References

-- FrancoisXavierAndreu - 2005-06-06 - 2006-04-07
-- SimonLeinen - 2006-01-05 - 2015-07-18

>
>
 

Packet Capture and Analysis Tools:

Detect protocol problems via the analysis of packets, trouble shooting

  • tcpdump: Packet capture tool based on the libpcap library. http://www.tcpdump.org/
  • snoop: tcpdump equivalent in Sun's Solaris operating system
  • Wireshark (formerly called Ethereal): Extended tcpdump with a user-friendly GUI. Its plugin architecture allowed many protocol-specific "dissectors" to be written. Also includes some analysis tools useful for performance debugging.
  • libtrace: A library for trace processing; works with libpcap (tcpdump) and other formats.
  • Netdude: The Network Dump data Displayer and Editor is a framework for inspection, analysis and manipulation of tcpdump trace files.
  • jnettop: Captures traffic coming across the host it is running on and displays streams sorted by the bandwidth they use.
  • tcptrace: a statistics/analysis tool for TCP sessions using tcpdump capture files
  • capturing packets with Endace DAG cards (dagsnap, dagconvert, ...)

General Hints for Taking Packet Traces

Capture enough data - you can always throw away stuff later. Note that tcpdump's default capture length is small (96 bytes), so use -s 0 (or something like -s 1540) if you are interested in payloads. Wireshark and snoop capture entire packets by default. Seemingly unrelated traffic can impact performance. For example, Web pages from foo.example.com may load slowly because of the images from adserver.example.net. In situations of high background traffic, it may however be necessary to filter out unrelated traffic.

It can be extremely useful to collect packet traces from multiple points in the network

  • On the endpoints of the communication
  • Near �suspicious� intermediate points such as firewalls

Synchronized clocks (e.g. by NTP) are very useful for matching traces.

Address-to-name resolution can slow down display and causes additional traffic which confuses the trace. With tcpdump, consider using -n or tracing to file (-w file).

Request remote packet traces

When a packet trace from a remote site is required, this often means having to ask someone at that site to provide it. When requesting such a trace, consider making this as easy as possible for the person having to do it. Try to use a packet tracing tool that is already available - tcpdump for most BSD or Linux systems, snoop for Solaris machines. Windows doesn't seem to come bundled with a packet capturing program, but you can direct the user to Wireshark, which is reasonably easy to install and use under Windows. Try to give clean indications on how to call the packet capture program. It is usually best to ask the user to capture to a file, and then send you the capture file as an e-mail attachment or so.

References

  • Presentation slides from a short talk about packet capturing techniques given at the network performance section of the January 2006 GEANT2 technical workshop

-- FrancoisXavierAndreu - 06 Jun 2005
-- SimonLeinen - 05 Jan 2006-09 Apr 2006
-- PekkaSavola - 26 Oct 2006

tcpdump

One of the early diagnostic tools for TCP/IP that was written by Van Jacobson, Craig Leres, and Steven McCanne. Tcpdump can be used to capture and decode packets in real time, or to capture packets to files (in "libpcap" format, see below), and analyze (decode) them later.

There are now more elaborate and, in some ways, user-friendly packet capturing programs, such as Wireshark (formerly called Ethereal), but tcpdump is widely available, widely used, so it is very useful to know how to use it.

Tcpdump/libpcap is still actively being maintained, although not by its original authors.

libpcap

Libpcap is a library that is used by tcpdump, and also names a file format for packet traces. This file format - usually used in files with the extension .pcap - is widely supported by packet capture and analysis tools.

Selected options

Some useful options to tcpdump include:

  • -s snaplen capture snaplen bytes of each frame. By default, tcpdump captures only the first 68 bytes, which is sufficient to capture IP/UDP/TCP/ICMP headers, but usually not payload or higher-level protocols. If you are interested in more than just headers, use -s 0 to capture packets without truncation.
  • -r filename read from an previously created capture file (see -w)
  • -w filename dump to a file instead of analyzing on-the-fly
  • -i interface capture on an interface other than the default (first "up" non-loopback interface). Under Linux, -i any can be used to capture on all interfaces, albeit with some restrictions.
  • -c count stop the capture after count packets
  • -n don't resolve addresses, port numbers, etc. to symbolic names - this avoids additional DNS traffic when analyzing a live capture.
  • -v verbose output
  • -vv more verbose output
  • -vvv even more verbose output

Also, a pflang expression can be appended to the command so as to filter the captured packets. An expression is made up of one or more of "type", "direction" and "protocol".

  • type Can be host, net or port. host is presumed unless otherwise speciified
  • dir Can be src, dst, src or dst or src and dst
  • proto (for protocol) Common types are ether, ip, tcp, udp, arp ... If none is specifiied then all protocols for which the value is valid are considered.
Example expressions:
  • dst host <address>
  • src host <address>
  • udp dst port <number>
  • host <host> and not port ftp and not port ftp-data

Usage examples

Capture a single (-c 1) udp packet to file test.pcap:

: root@diotima[tmp]; tcpdump -c 1 -w test.pcap udp
tcpdump: listening on bge0, link-type EN10MB (Ethernet), capture size 96 bytes
1 packets captured
3 packets received by filter
0 packets dropped by kernel

This produces a binary file containing the captured packet as well as a small file header and a timestamp:

: root@diotima[tmp]; ls -l test.pcap
-rw-r--r-- 1 root root 114 2006-04-09 18:57 test.pcap
: root@diotima[tmp]; file test.pcap
test.pcap: tcpdump capture file (big-endian) - version 2.4 (Ethernet, capture length 96)

Analyze the contents of the previously created capture file:

: root@diotima[tmp]; tcpdump -r test.pcap
reading from file test.pcap, link-type EN10MB (Ethernet)
18:57:28.732789 2001:630:241:204:211:43ff:fee1:9fe0.32832 > ff3e::beac.10000: UDP, length: 12

Display the same capture file in verbose mode:

: root@diotima[tmp]; tcpdump -v -r test.pcap
reading from file test.pcap, link-type EN10MB (Ethernet)
18:57:28.732789 2001:630:241:204:211:43ff:fee1:9fe0.32832 > ff3e::beac.10000: [udp sum ok] UDP, length: 12 (len 20, hlim 118)

More examples with some advanced tcpdump use cases.

References

-- SimonLeinen - 2006-03-04 - 2016-03-17 Warning: Can't find topic PERTKB.EthEreal

Solaris snoop

Sun's Solaris operating system includes a utility called snoop for taking and analyzing packet traces. While snoop is similar in spirit to tcpdump, its options, filter syntax, and stored file formats are all slightly different. The file format is described in the Informational RFC 1761, and supported by some other tools, notably Wireshark.

Selected options

Useful options in connection with snoop include:

  • -i filename read from capture file. Corresponds to tcpdump's -r option.
  • -o filename dump to file. Corresponds to tcpdump's -w option.
  • -d device capture on device rather than on the default (first non-loopback) interface. Corresponds to tcpdump's -i option.
  • -c count stop capture after capturing count packets (same as for tcpdump)
  • -r do not resolve addresses to names. Corresponds to tcpdump's -n option. This is useful for suppressing DNS traffic when looking at a live trace.
  • -V verbose summary mode - between (default) summary mode and verbose mode
  • -v (full) verbose mode

Usage examples

Capture a single (-c 1) udp packet to file test.snoop:

: root@diotima[tmp]; snoop -o test.snoop -c 1 udp
Using device /dev/bge0 (promiscuous mode)
1 1 packets captured

This produces a binary file containing the captured packet as well as a small file header and a timestamp:

: root@diotima[tmp]; ls -l test.snoop
-rw-r--r-- 1 root root 120 2006-04-09 18:27 test.snoop
: root@diotima[tmp]; file test.snoop
test.snoop:     Snoop capture file - version 2

Analyze the contents of the previously created capture file:

: root@diotima[tmp]; snoop -i test.snoop
  1   0.00000 2001:690:1fff:1661::3 -> ff1e::1:f00d:beac UDP D=10000 S=32820 LEN=20

Display the same capture file in "verbose summary" mode:

: root@diotima[tmp]; snoop -V -i test.snoop
________________________________
  1   0.00000 2001:690:1fff:1661::3 -> ff1e::1:f00d:beac ETHER Type=86DD (IPv6), size = 74 bytes
  1   0.00000 2001:690:1fff:1661::3 -> ff1e::1:f00d:beac IPv6  S=2001:690:1fff:1661::3 D=ff1e::1:f00d:beac LEN=20 HOPS=116 CLASS=0x0 FLOW=0x0
  1   0.00000 2001:690:1fff:1661::3 -> ff1e::1:f00d:beac UDP D=10000 S=32820 LEN=20

Finally, in full verbose mode:

: root@diotima[tmp]; snoop -v -i test.snoop
ETHER:  ----- Ether Header -----
ETHER:
ETHER:  Packet 1 arrived at 18:27:16.81233
ETHER:  Packet size = 74 bytes
ETHER:  Destination = 33:33:f0:d:be:ac, (multicast)
ETHER:  Source      = 0:a:f3:32:56:0,
ETHER:  Ethertype = 86DD (IPv6)
ETHER:
IPv6:   ----- IPv6 Header -----
IPv6:
IPv6:   Version = 6
IPv6:   Traffic Class = 0
IPv6:   Flow label = 0x0
IPv6:   Payload length = 20
IPv6:   Next Header = 17 (UDP)
IPv6:   Hop Limit = 116
IPv6:   Source address = 2001:690:1fff:1661::3
IPv6:   Destination address = ff1e::1:f00d:beac
IPv6:
UDP:  ----- UDP Header -----
UDP:
UDP:  Source port = 32820
UDP:  Destination port = 10000
UDP:  Length = 20
UDP:  Checksum = 0635
UDP:

Note: the captured packet is an IPv6 multicast packet from a "beacon" system. Such traffic forms the majority of the background load on our office LAN at the moment.

References

  • RFC 1761, Snoop Version 2 Packet Capture File Format, B. Callaghan, R. Gilligan, February 1995
  • snoop(1) manual page for Solaris 10, April 2004, docs.sun.com

-- SimonLeinen - 09 Apr 2006

Changed:
<
<

libtrace

libtrace is a library for trace processing. It supports multiple input methods, including device capture, raw and gz-compressed trace, and sockets; and mulitple input formats, including pcap and DAG.

References

-- SimonLeinen - 04 Mar 2006

Netdude

Netdude (Network Dump data Displayer and Editor) is a framework for inspection, analysis and manipulation of tcpdump (libpcap) trace files.

References

-- SimonLeinen - 04 Mar 2006

jnettop

Captures traffic coming across the host it is running on and displays streams sorted by the bandwidth they use. The result is a nice listing of communication on network grouped by stream, which shows transported bytes and consumed bandwidth.

References

-- FrancoisXavierAndreu - 06 Jun 2005 -- SimonLeinen - 04 Mar 2006

Host and Application Measurement Tools

Network performance is only one part of a distrbuted system's perfomance. An equally important part is the performance od the actual application and hardware it runs on.

  • Web100 - fine-grained instrumentation of TCP implementation, currently available for the Linux kernel.
  • SIFTR - TCP instrumentation similar to Web100, available for *BSD
  • DTrace - Dynamic tracing facility available in Solaris and *BSD
  • NetLogger - a framework for instrumentation and log collection from various components of networked applications

-- TobyRodwell - 22 Mar 2006
-- SimonLeinen - 28 Mar 2006

NetLogger

NetLogger (copyright Lawrence Berkeley National Laboratory) is both a set of tools and a methodology for analysing the performance of a distributed system. The methodolgoy (below) can be implemented separately from the LBL developed tools. For full information on NetLogger see its website, http://dsd.lbl.gov/NetLogger/

Methodology

From the NetLogger website

The NetLogger methodology is really quite simple. It consists of the following:

  1. All components must be instrumented to produce monitoring These components include application software, middleware, operating system, and networks. The more components that are instrumented the better.
  2. All monitoring events must use a common format and common set of attributes. Monitoring events most also all contain a precision timestamp which is in a single timezone (GMT) and globally synchronized via a clock synchronization method such as NTP.
  3. Log all of the following events: Entering and exiting any program or software component, and begin/end of all IO (disk and network).
  4. Collect all log data in a central location
  5. Use event correlation and visualization tools to analyze the monitoring event logs.

Toolkit

From the NetLogger website

The NetLogger Toolkit includes a number of separate components which are designed to help you do distributed debugging and performance analysis. You can use any or all of these components, depending on your needs.

These include:

  • NetLogger message format and data model: A simple, common message format for all monitoring events which includes high-precision timestamps
  • NetLogger client API library: C/C++, Java, PERL, and Python calls that you add to your existing source code to generate monitoring events. The destination and logging level of NetLogger messages are all easily controlled using an environment variable. These libraries are designed to be as lightweight as possible, and to never block or adversely affect application performance.
  • NetLogger visualization tool (nlv): a powerful, customizable X-Windows tool for viewing and analysis of event logs based on time correlated and/or object correlated events.
  • NetLogger host/network monitoring tools: a collection of NetLogger-instrumented host monitoring tools, including tools to interoperate with Ganglia and MonALisa.
  • NetLogger storage and retrieval tools, including a daemon that collects NetLogger events from several places at a single, central host; a forwarding daemon to forward all NetLogger files in a specified directory to a given location; and a NetLogger event archive based on mySQL.

References

-- TobyRodwell - 22 Mar 2006

>
>
 

Web100 Linux Kernel Extensions

The Web100 project was run by PSC (Pittsburgh Supercomputing Center), the NCAA and NCSA. It was funded by the US National Science Foundation (NSF) between 2000 and 2003, although development and maintenance work extended well beyond the period of NSF funding. Its thrust was to close the "wizard gap" between what performance should be possible on modern research networks and what most users of these networks actually experience. The project focused on instrumentation for TCP to measure performance and find possible bottlenecks that limit TCP throughput. In addition, it included some work on "auto-tuning" of TCP buffer settings.

Most implementation work was done for Linux, and most of the auto-tuning code is now actually included in the mainline Linux kernel code (as of 2.6.17). The TCP kernel instrumentation is available as a patch from http://www.web100.org/, and usually tracks the latest "official" Linux kernel release pretty closely.

An interesting application of Web100 is NDT, which can be used from any Java-enabled browser to detect bottlenecks in TCP configurations and network paths, as well as duplex mismatches using active TCP tests against a Web100-enabled server.

In September 2010, the NSF agreed to fund a follow-on project called Web10G.

TCP Kernel Information Set (KIS)

A central component of Web100 is a set of "instruments" that permits the monitoring of many statistics about TCP connections (sockets) in the kernel. In the Linux implementation, these instruments are accessible through the proc filesystem.

TCP Extended Statistics MIB (TCP-ESTATS-MIB, RFC 4898)

The TCP-ESTATS-MIB (RFC 4898) includes a similar set of instruments, for access through SNMP. It has been implemented by Microsoft for the Vista operating system and later versions of Windows. In the Windows Server 2008 SDK, a tool called TcpAnalyzer.exe can be used to look at statistics of open TCP connections. IBM is also said to have an implementation of this MIB.

"Userland" Tools

Besides the kernel extension, Web100 comprises a small set of user-level tools which provide access to the TCP KIS. These tools include

  1. a libweb100 library written in C
  2. the command-line tools readvar, deltavar, writevar, and readall
  3. a set of GTK+-based GUI (graphical user interface) tools under the gutil command.

gutil

When started, gutil shows a small main panel with an entry field for specifying a TCP connection, and several graphical buttons for starting different tools on a connection once one has been selected.

gutil: main panel

The TCP connection can be chosen either by explicitly specifying its endpoints, or by selecting from a list of connections (using double-click):

gutil: TCP connection selection dialog

Once a connection of interest has been selected, a number of actions are possible. The "List" action provides a list of all kernel instruments for the connection. The list is updated every second, and "delta" values are displayed for those variables that have changed.

gutil: variable listing

Another action is "Display", which provides a graphical display of a KIS variable. The following screenshot shows the display of the DataBytesIn variable of an SSH connection.

gutil: graphical variable display

Related Work

Microsoft's Windows Software Development Kit for Server 2008 and .NET Version 3.5 contains a tool called TcpAnalyzer.exe, which is similar to gutil and uses Microsoft's RFC 4898 implementation.

The SIFTR module for FreeBSD can be used for similar applications, namely to understand what is happening inside TCP at fine granularity. Sun's DTrace would be an alternative on systems that support it, provided they offer suitable probes for the relevant actions and events within TCP. Both SIFTR and DTrace have very different user interfaces to Web100.

References

-- SimonLeinen - 27 Feb 2006 - 25 Mar 2011
-- ChrisWelti - 12 Jan 2010

End-System (Host) Tuning

This section contains hints for tuning end systems (hosts) for maximum performance.

References

-- SimonLeinen - 31 Oct 2004 - 23 Jul 2009
-- PekkaSavola - 17 Nov 2006

<-- ---++Calculating required TCP buffer sizes
Regardless of operating system, TCP buffers must be appropriately dimensioned if they are to support long distance, high data rate flows.  Victor Reijs's TCP Throughput calculator is able to determine the maximum achievable TCP throughput based on RTT, expected packet loss, segment size, re-transmission timeout (RTO) and buffer size.  It can be found at:

http://people.heanet.ie/~vreijs/tcpcalculations.htm

Tolerance to Packet Re-ordering

As explained in the section on packet re-ordering, packets can be re-ordered wherever there is parellelism in a network. If such parellelism exists in a network (e.g. as caused by the internal architectire of the Juniper M160s used in the GEANT network) then TCP should be set to tolerate a larger number of re-ordered packets than the default, which is 3 (see here for an example for Linux). For large flows across GEANT a value of 20 should be sufficient - an alternative work-round is to use multiple parallel TCP flows.

Modifying TCP Behaviour

TCP Congestion control algorithms are often configured conservatively, to avoid 'slow start' overwhelming a network path. In a capacity rich network this may not be necessary, and may be over-ridden.

-- TobyRodwell - 24 Jan 2005

Revision 12006-04-09 - SimonLeinen

Line: 1 to 1
Added:
>
>
META TOPICPARENT name="WebHome"

DS3.3.3a - Basic Guide to Network Performance

Notes About Performance

This part of the PERT Knowledgebase contains general observations and hints about the performance of networked applications.

The mapping between network performance metrics and user-perceived performance is complex and often non-intuitive.

-- SimonLeinen - 13 Dec 2004 - 07 Apr 2006

User-Perceived Performance

User-perceived performance of network applications is made up of a number of mainly qualitative metrics, some of which are in conflict with each other. In the case of some applications, a single metric will outweigh the others, such as responsiveness from video services or throughput for bulk transfer applications. More commonly, a combination of factors usually determines the experience of network performance for the end-user.

Responsiveness

One of the most important user experiences in networking applications is the perception of responsiveness. If end-users feel that an application is slow, it is often the case that it is slow to respond to them, rather than being directly related to network speed. This is a particular issue for real-time applications such as audio/video conferencing systems and must be prioritised in applications such as remote medical services and off-campus teaching facilities. It can be difficult to quantitatively define an acceptable figure for response times as the requirements may vary from application to application.

However, some applications have relatively well-defined "physiological" bounds beyond which the responsiveness feeling vanishes. For example, for voice conversations, a (round-trip) delay of 150ms is practically unnoticeable, but an only slightly larger delay is typically felt as very intrusive.

Throughput/Capacity/"Bandwidth"

Throughput per se is not directly perceived by the user, although a lack of throughput will increase waiting times and reduce the impression of responsiveness. However, "bandwidth" is widely used as a "marketing" metric to differentiate "fast" connections from "slow" ones, and many applications display throughput during long data transfers. Therefore users often have specific performance expectations in terms of "bandwidth", and are disappointed when the actual throughput figures they see is significantly lower than the advertised capacity of their network connection.

Reliability

Reliability is often the most important performance criterion for a user: The application must be available when the user requires it. Note that this doesn't necessarily mean that it is always available, although there are some applications - such as the public Web presence of a global corporation - that have "24x7" availability requirements.

The impression of reliability will also be heavily influenced by what happens (or is expected to happen) in cases of unavailability: Does the user have a possibility to build a workaround? In case of provider problems: Does the user have someone competent to call - or can they even be sure that the provider will notice the problem themselves, and fix it in due time? How is the user informed during the outage, in particular concerning the estimated time to repair?

Another aspect of reliability is the predictability of performance. It can be profoundly disturbing to a user to see large performance variations over time, even if the varying performance is still within the required performance range - who can guarantee that variations won't increase beyond the tolerable during some other time when the application is needed? E.g., a 10 Mb/s throughput that remains rock-stable over time can feel more reliable than throughput figures that vary between 200 and 600 Mb/s.

-- SimonLeinen - 07 Apr 2006
-- AlessandraScicchitano - 10 Feb 2012

Responsiveness

One of the most important user experiences in networking applications is the perception of responsiveness. If end-users feel that an application is slow, it is often the case that it is slow to respond to them, rather than being directly related to network speed. This is a particular issue for real-time applications such as audio/video conferencing systems and must be prioritised in applications such as remote medical services and off-campus teaching facilities. It can be difficult to quantitatively define an acceptable figure for response times as the requirements may vary from application to application.

However, some applications have relatively well-defined "physiological" bounds beyond which the responsiveness feeling vanishes. For example, for voice conversations, a (round-trip) delay of 150ms is practically unnoticeable, but an only slightly larger delay is typically felt as very intrusive.

-- SimonLeinen - 07 Apr 2006 - 23 Aug 2007

Throughput/Capacity/"Bandwidth"

Throughput per se is not directly perceived by the user, although a lack of throughput will increase waiting times and reduce the impression of responsiveness. However, "bandwidth" is widely used as a "marketing" metric to differentiate "fast" connections from "slow" ones, and many applications display throughput during long data transfers. Therefore users often have specific performance expectations in terms of "bandwidth", and are disappointed when the actual throughput figures they see is significantly lower than the advertised capacity of their network connection.

-- SimonLeinen - 07 Apr 2006

Reliability

Reliability is often the most important performance criterion for a user: The application must be available when the user requires it. Note that this doesn't necessarily mean that it is always available, although there are some applications - such as the public Web presence of a global corporation - that have "24x7" availability requirements.

The impression of reliability will also be heavily influenced by what happens (or is expected to happen) in cases of unavailability: Does the user have a possibility to build a workaround? In case of provider problems: Does the user have someone competent to call - or can they even be sure that the provider will notice the problem themselves, and fix it in due time? How is the user informed during the outage, in particular concerning the estimated time to repair?

Another aspect of reliability is the predictability of performance. It can be profoundly disturbing to a user to see large performance variations over time, even if the varying performance is still within the required performance range - who can guarantee that variations won't increase beyond the tolerable during some other time when the application is needed? E.g., a 10 Mb/s throughput that remains rock-stable over time can feel more reliable than throughput figures that vary between 200 and 600 Mb/s.

-- SimonLeinen - 07 Apr 2006

The "Wizard Gap"

The Wizard Gap is an expression coined by Matt Mathis (then PSC) in 1999. It designates the difference between the performance that is "theoretically" possible on today's high-speed networks (in particular, research networks), and the performance that most users actually perceive. The idea is that today, the "theoretical" performance can only be (approximately) obtained by "wizards" with superior knowledge and skills concerning system tuning. Good examples for "Wizard" communities are the participants in Internet2 Land-Speed Record or SC Bandwidth Challenge competitions.

The Internet2 end-to-end performance initiative strives to reduce the Wizard Gap by user education as well as improved instrumentation (see e.g. Web100) of networking stacks. In the GÉANT community, PERTs focus on assistance to users (case management), as well as user education through resources such as this knowledge base. Commercial players are also contributing to closing the wizard gap, by improving "out-of-the-box" performance of hardware and software, so that their customers can benefit from faster networking.

References

  • M. Mathis, Pushing up Performance for Everyone, Presentation at the NLANR/Internet2 Joint Techs Workshop, 1999 (PPT)

-- SimonLeinen - 2006-02-28 - 2016-04-27

Why Latency Is Important

Traditionally, the metric of focus in networking has been bandwidth. As more and more parts of the Internet have their capacity upgraded, bandwidth is often not the main problem anymore. However, network-induced latency as measured in One-Way Delay (OWD) or Round-Trip Time often has noticeable impact on performance.

It's not just for gamers...

The one group of Internet users today that is most aware of the importance of latency are online gamers. It is intuitively obvious than in real-time multiplayer games over the network, players don't want to be put at a disadvantage because their actions take longer to reach the game server than their opponents'.

However, latency impacts the other users of the Internet as well, probably much more than they are aware of. At a given bottleneck bandwidth, a connection with lower latency will reach its achievable rate faster than one with higher latency. The effect of this should not be underestimated, since most connections on the Internet are short - in particular connections associated with Web browsing.

But even for long connections, latency often has a big impact on performance (throughput), because many of those long connections have their throughput limited by the window size that is available for TCP. And when the window size is the bottleneck, throughput is inversely proportional to round-trip time. Furthermore, for a given (small) loss rate, RTT places an upper limit on the achieveable throughput of a TCP connection, as shown by the Mathis Equation

But I thought latency was only important for multimedia!?

Common wisdom is that latency (and also jitter) is important for audio/video ("multimedia") applications. This is only partly true: Many applications of audio/video involve "on-demand" unidirectional transmission. In those applications, the real-time concerns can often be mitigated by clever buffering or transmission ahead of time. For conversational audio/video, such as "Voice over IP" or videoconferencing, the latency issue is very real. The principal sources of latency in these applications are not backbone-related, but related to compression/sampling rates (see packetization delay) and to transcoding devices such as H.323 MCUs (Multi-Channel Units).

References

-- SimonLeinen - 13 Dec 2004

TCP (Transmission Control Protocol)

The Transmission Control Protocol (TCP) is the prevalent transport protocol used on the Internet today. It uses window-based transmission to provide the service of a reliable byte stream, and adapts the rate of transfer to the state (of congestion) of the network and the receiver. Basic mechanisms include:

  • Segments that fit into IP packets, into which the byte-stream is split by the sender,
  • a checksum for each segment,
  • a Window, which bounds the amount of data "in flight" between the sender and the receiver,
  • Acknowledgements, by which the receiver tells the sender about segments that were received successfully,
  • a flow-control mechanism, which regulates data rate as a function of network and receiver conditions.

Originally specified in September 1981 RFC 793, TCP was clarified, refined and extended in many documents. Perhaps most importantly, congestion control was introduced in "TCP Tahoe" in 1988, described in Van Jacobson's 1988 SIGCOMM article on "Congestion Avoidance and Control". It can be said that TCP's congestion control is what keeps the Internet working when links are overloaded. In today's Internet, the enhanced "Reno" variant of congestion control is probably the most widespread.

RFC 7323 (formerly RFC 1323) specifies a set of "TCP Extensions for High Performance", namely the Window Scaling Option, which provides for much larger windows than the original 64K, the Timestamp Option and the PAWS (Protection Against Wrapped Sequence numbers) mechanism. These extensions are supported by most contemporary TCP stacks, although they frequently must be activated explicitly (or implicitly by configuring large TCP windows).

Another widely implemented performance enhancement to TCP are selective acknowledgements (SACK, RFC 2018). In TCP as originally specified, the acknowledgements ("ACKs") sent by a receiver were always "cumulative", that is, they specified the last byte of the part of the stream that was completely received. This is advantageous with large TCP windows, in particular where chances are high that multiple segments within a single window are lost. RFC 2883 describes an extension to SACK which makes TCP more robust in the presence of packet reordering in the network.

In addition, RFC 3449 - TCP Performance Implications of Network Path Asymmetry provides excellent information since a vast majority of Internet connections are asymmetrical.

Further Reading

References

  • RFC 7414, A Roadmap for Transmission Control Protocol (TCP) Specification Documents]], M. Duke, R. Braden, W. Eddy, E. Blanton, A. Zimmermann, February 2015
  • RFC 793, Transmission Control Protocol]], J. Postel, September 1981
  • draft-ietf-tcpm-rfc793bis-10, Transmission Control Protocol Specification, Wesley M. Eddy, July 2018
  • RFC 7323, TCP Extensions for High Performance, D. Borman, B. Braden, V. Jacobson, B. Scheffenegger, September 2014 (obsoletes RFC 1323 from May 1993)
  • RFC 2018, TCP Selective Acknowledgement Options]], M. Mathis, J. Mahdavi, S. Floyd, A. Romanow, October 1996
  • RFC 5681, TCP Congestion Control]], M. Allman, V. Paxson, E. Blanton, September 2009
  • RFC 3449, TCP Performance Implications of Network Path Asymmetry]], H. Balakrishnan, V. Padmanabhan, G. Fairhurst, M. Sooriyabandara, December 2002

There is ample literature on TCP, in particular research literature on its performance and large-scale behaviour.

Juniper has a nice white paper, Supporting Differentiated Service Classes: TCP Congestion Control Mechanisms (PDF format) explaining TCP's congestion control as well as many of the enhancements proposed over the years.

-- Simon Leinen - 2004-10-27 - 2018-07-02 -- Ulrich Schmid - 2005-05-31

Window-Based Transmission

TCP is a sliding-window protocol. The receiver tells the sender the available buffer space at the receiver (TCP header field "window"). The total window size is the minimum of sender buffer size, advertised receiver window size and congestion window size.

The sender can transmit up to this amount of data before having to wait for further buffer update from the receiver and should not have more than this amount of data in transit in the network. The sender must buffer the sent data until it has been ACKed by the receiver, so that the data can be retransmitted if neccessary. For each ACK the sent segment left the window and a new segment fills the window if it fits the (possibly updated) window buffer.

Due to TCP's flow control mechanism, TCP window size can limit the maximum theoretical throughput regardless of the bandwidth of the network path. Using too small a TCP window can degrade the network performance lower than expected and a too large window may have the same problems in case of error recovery.

The TCP window size is the most important parameter for achieving maximum throughput across high-performance networks. To reach the maximum transfer rate, the TCP window should be no smaller than the bandwidth-delay product.

Window size (bytes) => Bandwidth (bytes/sec) x Round-trip time (sec)

Example:

window size: 8192 bytes
round-trip time: 100ms
maximum throughput: < 0.62 Mbit/sec.

References

<--   * Calculating TCP goodput, Javascript-based on-line calculator by V. Reijs, https://webmail.heanet.ie/~vreijs/tcpcalculations.htm -->

-- UlrichSchmid & SimonLeinen - 31 May-07 Jun 2005

Large TCP Windows

In order to achieve high data rates with TCP over "long fat networks", i.e. network paths with a large bandwidth-delay product, TCP sinks (that is, hosts receiving data transported by TCP) must advertise a large TCP receive window (referred to as just 'the window', since there is not an equivalent advertised 'send window').

The window is a 16 bit value (bytes 15 and 16 in the TCP header) and so, in TCP as originally specified, it is limited to a value of 65535 (64K). The receive window sets an upper limit on the sustained throughput achieveable over a TCP connection since it represents the maximum amount of unacknowledged data (in bytes) there can be on the TCP path. Mathematically, achieveable throughput can never be more than WINDOW_SIZE/RTT, so for a trans-Atlantic link, with say an RTT (Round trip Time) of 150ms, the throughput is limited to a maximum of 3.4Mbps. With the emergence of "long fat networks", the limit of 64K bytes (on some systems even just 32K bytes!) was clearly insufficient and so RFC 7323 laid down (amongst other things) a way of scaling the advertised window, such that the 16-bit window value can represent numbers larger than 64K.

RFC 7323 extensions

RFC 7323, TCP Extensions for High Performance, (and formerly RFC 1323) defines several mechanisms to enable high-speed transfers over LFNs: Window Scaling, TCP Timestamps, and Protection Against Wrapped Sequence numbers (PAWS).

The TCP window scaling option increases the maximum window size fom 64KB to 1Gbyte, by shifting the window field left by up to 14. The window scale option is used only during the TCP 3-way handshake (both sides must set the window scale option in their SYN segments if scaling is to be used in either direction).

It is important to use TCP timestamps option with large TCP windows. With the TCP timestamps option, each segment contains a timestamp. The receiver returns that timestamp in each ACK and this allows the sender to estimate the RTT. On the other hand with the TCP timestamps option the problem of wrapped sequence number could be solved (PAWS - Protection Against Wrapped Sequences) which could occur with large windows.

(Auto) Configuration

In the past, most operating systems required manual tuning to use large TCP windows. The OS-specific tuning section contains information on how to to this for a set of operating systems.

Since around 2008-2010, many popular operating systems will use large windows and the necessary protocol options by default, thanks to TCP Buffer Auto-Tuning.

Can TCP Windows ever be too large?

There are several potential issues when TCP Windows are larger than necessary:

  1. When there are many active TCP connection endpoints (sockets) on a system - such as a popular Web or file server - then a large TCP window size will lead to high consumption of system (kernel) memory. This can have a number of negative consequences: The system may run out of buffer space so that no new connections can be opened, or the high occupation of kernel memory (which typically must reside in actual RAM and cannot be "paged out" to disk) can "starve" other processes of access to fast memory (cache and RAM)
  2. Large windows can cause large "bursts" of consecutive segments/packets. When there is a bottleneck in the path - because of a slower link or because of cross-traffic - these bursts will fill up buffers in the network device (router or switch) in front of that bottleneck. The larger these bursts, the higher are the risks that this buffer overflows and causes multiple segments to be dropped. So a large window can lead to "sawtooth" behavior and worse link utilisation than with a just-big-enough window where TCP could operate at a steady rate.

Several methods for automatic TCP buffer tuning have been developed to resolve these issues, some of which have been implemented in (at least) recent Linux versions.

References

-- SimonLeinen - 27 Oct 2004 - 27 Sep 2014

TCP Acknowledgements

In TCP's sliding-window scheme, the receiver acknowledges the data it receives, so that the sender can advance the window and send new data. As originally specified, TCP's acknowledgements ("ACKs") are cumulative: the receiver tells the sender how much consecutive data it has received. More recently, selective acknowledgements were introduced to allow more fine-grained acknowledgements of received data.

Delayed Acknowledgements and "ack every other"

RFC 831 first suggested a delayed acknowledgement (Delayed ACK) strategy, where a receiver doesn't always immediately acknowledge segments as it receives them. This recommendation was carried forth and specified in more detail in RFC 1122 and RFC 5681 (formerly known as RFC 2581). RFC 5681 mandates that an acknowledgement be sent for at least every other full-size segment, and that no more than 500ms expire before any segment is acknowledged.

The resulting behavior is that, for longer transfers, acknowledgements are only sent for every two segments received ("ack every other"). This is in order to reduce the amount of reverse flow traffic (and in particular the number of small packets). For transactional (request/response) traffic, the delayed acknowledgement strategy often makes it possible to "piggy-back" acknowledgements on response data segments.

A TCP segment which is only an acknowledgement i.e. has no payload, is termed a pure ack.

Delayed Acknowledgments should be taken into account when doing RTT estimation. As an illustration, see this change note for the Linux kernel from Gavin McCullagh.

Critique on Delayed ACKs

John Nagle nicely explains problems with current implementations of delayed ACK in a comment to a thread on Hacker News:

Here's how to think about that. A delayed ACK is a bet. You're betting that there will be a reply, upon with an ACK can be piggybacked, before the fixed timer runs out. If the fixed timer runs out, and you have to send an ACK as a separate message, you lost the bet. Current TCP implementations will happily lose that bet forever without turning off the ACK delay. That's just wrong.

The right answer is to track wins and losses on delayed and non-delayed ACKs. Don't turn on ACK delay unless you're sending a lot of non-delayed ACKs closely followed by packets on which the ACK could have been piggybacked. Turn it off when a delayed ACK has to be sent.

Duplicate Acknowledgements

A duplicate acknowledgement (DUPACK) is one with the same acknowledgement number as its predecessor - it signifies that the TCP receiver has received a segment newer than the one it was expecting i.e. it has missed a segment. The missed segment might not be lost, it might just be re-ordered. For this reason the TCP sender will not assume data loss on the first DUPACK but (by default) on the third DUPACK, when it will, as per RFC 5681, perform a "Fast Retransmit", by sending the segment again without waiting for a timeout. DUPACKs are never delayed; they are sent immediately the TCP receiver detects an out-of-order segment.

"Quickack mode" in Linux

The TCP implementation in Linux has a special receive-side feature that temporarily disables delayed acknowledgements/ack-every-other when the receiver thinks that the sender is in slow-start mode. This speeds up the slow-start phase when RTT is high. This "quickack mode" can also be explicitly enabled or disabled with setsockopt() using the TCP_QUICKACK option. The Linux modification has been criticized because it makes Linux' slow-start more aggressive than other TCPs' (that follow the SHOULDs in RFC 5681), without sufficient validation of its effects.

"Stretch ACKs"

Techniques such as LRO and GRO (and to some level, Delayed ACKs as well as ACKs that are lost or delayed on the path) can cause ACKs to cover many more than the two segments suggested by the historic TCP standards. Although ACKs in TCP have always been defined as "cumulative" (with the exception of SACKs), some congestion control algorithms have trouble with ACKs that are "stretched" in this way. In January 2015, a patch set was submitted to the Linux kernel network development with the goal to improve the behavior of the Reno and CUBIC congestion control algorithms when faced with stretch ACKs.

References

  • TCP Congestion Control, RFC 5681, M. Allman, V. Paxson, E. Blanton, September 2009
  • RFC 813, Window and Acknowledgement Strategy in TCP, D. Clark, July 1982
  • RFC 1122, Requirements for Internet Hosts -- Communication Layers, R. Braden (Ed.), October 1989

-- TobyRodwell - 2005-04-05
-- SimonLeinen - 2007-01-07 - 2015-01-28

TCP Window Scaling Option

TCP as originally specified only allows for windows up to 65536 bytes, because the window-related protocol fields all have a width of 16 bits. This prevents the use of large TCP windows, which are necessary to achieve high transmission rates over paths with a large bandwidth*delay product.

The Window Scaling Option is one of several options defined in RFC 7323, "TCP Extensions for High Performance". It allows TCP to advertise and use a window larger than 65536 bytes. The way this works is that in the initial handshake, the a TCP speaker announces a scaling factor, which is a power of two between 20 (no scaling) and 214, to allow for an effective window of 230 (one Gigabyte). Window Scaling only comes into effect when both ends of the connection advertise the option (even with just a scaling factor of 20).

As explained under the large TCP windows topic, Window Scaling is typically used in connection with the TCP Timestamp option and the PAWS (Protection Against Wrapped Sequence numbers), both defined in RFC 7323.

32K limitations on some systems when Window Scaling is not used

Some systems, notably Linux, limit the usable window in the absence of Window Scaling to 32 Kilobytes. The reasoning behind this is that some (other) systems used in the past were broken by larger windows, because they erroneously interpreted the 16-bit values related to TCP windows as signed, rather than unsigned, integers.

This limitation can have the effect that users (or applications) that request window sizes between 32K and 64K, but don't have Window Scaling enabled, do not actually benefit from the desired window sizes.

The artificial limitation was finally removed from the Linux kernel sources on March 21, 2006. The old behavior (window limited to 32KB when Window Scaling isn't used) can still be enabled through a sysctl tuning parameter, in case one really wants to interoperate with TCP implementations that are broken in the way described above.

Problems with window scaling

Middlebox issues

MiddleBoxes which do not understand window scaling may cause very poor performance, as described in WindowScalingProblems. Windows Vista limits the scaling factor for HTTP transactions to 2 to avoid some of the problems, see WindowsOSSpecific.

Diagnostic issues

The Window Scaling option has an impact on how the offered window in TCP ACKs should be interpreted. This can cause problems when one is faced with incomplete packet traces that lack the initial SYN/SYN+ACK handshake where the options are negotiated: The SEQ/ACK sequence may look incorrect because scaling cannot be taken into account. One effect of this is that tcpgraph or similar analysis tools (included in tools such as WireShark) may produce incoherent results.

References

  • RFC 7323, TCP Extensions for High Performance, D. Borman, B. Braden, V. Jacobson, B. Scheffenegger, September 2014 (obsoletes RFC 1323 from May 1993)
  • git note about the Linux kernel modification that got rid of the 32K limit, R. Jones and D.S. Miller, March 2006. Can also be retrieved as a patch.

-- SimonLeinen - 21 Mar 2006 - 27 September 2014
-- PekkaSavola - 07 Nov 2006
-- AlexGall - 31 Aug 2007 (added reference to scaling limitiation in Windows Vista)

TCP Flow Control

Note: This topic describes the Reno enhancement of classical "Van Jacobson" or Tahoe congestion control. There have been many suggestions for improving this mechanism - see the topic on high-speed TCP variants.

TCP flow control and window size adjustment is mainly based on two key mechanism: Slow Start and Additive Increase/Multiplicative Decrease (AIMD), also known as Congestion Avoidance. (RFC 793 and RFC 5681)

Slow Start

To avoid that a starting TCP connection floods the network, a Slow Start mechanism was introduced in TCP. This mechanism effectively probes to find the available bandwidth.

In addition to the window advertised by the receiver, a Congestion Window (cwnd) value is used and the effective window size is the lesser of the two. The starting value of the cwnd window is set initially to a value that has been evolving over the years, the TCP Initial Window. After each acknowledgment, the cwnd window is increased by one MSS. By this algorithm, the data rate of the sender doubles each round-trip time (RTT) interval (actually, taking into account Delayed ACKs, rate increases by 50% every RTT). For a properly implemented version of TCP this increase continues until:

  • the advertised window size is reached
  • congestion (packet loss) is detected on the connection.
  • there is no traffic waiting to take advantage of an increased window (i.e. cwnd should only grow if it needs to)

When congestion is detected, the TCP flow-control mode is changed from Slow Start to Congestion Avoidance. Note that some TCP implementations maintain cwnd in units of bytes, while others use units of full-sized segments.

Congestion Avoidance

Once congestion is detected (through timeout and/or duplicate ACKs), the data rate is reduced in order to let the network recover.

Slow Start uses an exponential increase in window size and thus also in data rate. Congestion Avoidance uses a linear growth function (additive increase). This is achieved by introducing - in addition to the cwnd window - a slow start threshold (ssthresh).

As long as cwnd is less than ssthresh, Slow Start applies. Once ssthresh is reached, cwnd is increased by at most one segment per RTT. The cwnd window continues to open with this linear rate until a congestion event is detected.

When congestion is detected, ssthresh is set to half the cwnd (or to be strictly accurate, half the "Flight Size". This distinction is important if the implementation lets cwnd grow beyond rwnd (the receiver's declared window)). cwnd is either set to 1 if congestion was signalled by a timeout, forcing the sender to enter Slow Start, or to ssthresh if congestion was signalled by duplicate ACKs and the Fast Recovery algorithm has terminated. In either case, once the sender enters Congestion Avoidance, its rate has been reduced to half the value at the time of congestion. This multiplicative decrease causes the cwnd to close exponentially with each detected loss event.

Fast Retransmit

In Fast Retransmit, the arrival of three duplicate ACKs is interpreted as packet loss, and retransmission starts before the retransmission timer (RTO) expires.

The missing segment will be retransmitted immediately without going through the normal retransmission queue processing. This improves performance by eliminating delays that would suspend effective data flow on the link.

Fast Recovery

Fast Recovery is used to react quickly to a single packet loss. In Fast recovery, the receipt of 3 duplicate ACKs, while being taken to mean a loss of a segment, does not result in a full Slow Start. This is because obviously later segments got through, and hence congestion is not stopping everything. In fast recovery, ssthresh is set to half of the current send window size, the missing segment is retransmitted (Fast Retransmit) and cwnd is set to ssthresh plus three segments. Each additional duplicate ACK indicates that one segment has left the network at the receiver and cwnd is increased by one segment to allow the transmission of another segment if allowed by the new cwnd. When an ACK is received for new data, cwmd is reset to the ssthresh, and TCP enters congestion avoidance mode.

References

-- UlrichSchmid - 07 Jun 2005 -- SimonLeinen - 27 Jan 2006 - 23 Jun 2011

High-Speed TCP Variants

There have been numerous ideas for improving TCP over the years. Some of those ideas have been adopted by mainstream operations (after thorough review). Recently there has been an uptake in work towards improving TCP's behavior with LongFatNetworks. It has been proven that the current congestion control algorithms limit the efficiency in network resource utilization. The various types of new TCP implementations introduce changes in defining the size of congestion window. Some of them require extra feedback from the network. Generally, all such congestion control protocols are divided as follows.

An orthogonal technique for improving TCP's performance is automatic buffer tuning.

Comparative Studies

All papers about individual TCP-improvement proposals contain comparisons against older TCP to quantify the improvement. There are several studies that compare the performance of the various new TCP variants, including

Other References

  • Gigabit TCP, G. Huston, The Internet Protocol Journal, Vol. 9, No. 2, June 2006. Contains useful descriptions of many modern TCP enhancements.
  • Faster, G. Huston, June 2005. This article from ISP Column looks at various approaches to TCP transfers at very high rates.
  • Congestion Control in the RFC Series, M. Welzl, W. Eddy, July 2006, Internet-Draft (work in progress)
  • ICCRG Wiki, IRTF (Internet Research Task Force) ICCRG (Internet Congestion Control Research Group), includes bibliography on congestion control.
  • RFC 6077: Open Research Issues in Internet Congestion Control, D. Papadimitriou (Ed.), M. Welzl, M. Scharf, B. Briscoe, February 2011

Related Work

  • PFLDnet (International Workshop on Protocols for Fast Long-Distance Networks): 2003, 2004, 2005, 2006, 2007, 2009, 2010. (The 2008 pages seem to have been removed from the Net.)

-- Simon Leinen - 2005-05-28–2016-09-18 -- Orla Mc Gann - 2005-10-10 -- Chris Welti - 2008-02-25

HighSpeed TCP

HighSpeed TCP is a modification to TCP congestion control mechanism, proposed by Sally Floyd from ICIR (The ICSI Center for Internet Research). The main problem of TCP connection is that it takes a long time to make a full recovery from packet loss for high-bandwidth long-distance connections. Like the others, new TCP implementations proposes a modification of congestion control mechanism for use with TCP connections with large congestion windows. For the current standard TCP with a steady-state packet loss rate p, an average congestion window is 1.2/sqrt(p) segments. It places a serious constraint in realistic environments.

Achieving an average TCP throughput of B bps requires a loss event at most every BR/(12D) round-trip times and an average congestion window W is BR/(8D). For round-trip time R = 0.1 seconds and 1500-bytes packets (D), throughput of 10 Gbps would require an average congestion window of 83000 segments, and a packet drop rate (equal 0,0000000002) of at most one congestion event every 5,000,000,000 packets or equivalently at most one congestion event every 1 2/3 hours. This is an unrealistic constraint and has a huge impact on TCP connections with larger congestion windows. So what we need is to achieve high per-connection throughput without requiring unrealistically low packet loss rates.

HighSpeed TCP proposes changing TCP response function w = 1.2/sqrt(p) to achieve high throughput with more realistic requirements for the steady-state packet drop rate. A modified HighSpeed TCP function uses three parameters: Low_Window, High_Window, and High_P. The HighSpeed response function uses the same response function as Standard TCP when the current congestion window is at most Low_Window, and uses the HighSpeed response function when the current congestion window is greater than Low_Window. The value of the average congestion window w greater than Low_Window is

HSTCP_w.png

where,

  • Low_P is the packet drop rate corresponding to Low_Window, using Standard TCP response function,
  • S is a constant as follows

HSTCP_S.png

The window needed to sustain 10Gbps throughput using HighSpeed TCP connection is 83000 segments, so the High_Window could be set to this value. The packet drop rate needed in the HighSpeed reponse function to achieve an average congestion window of 83000 segments is 10^-7. For informations on how to set up the remaining parameters see document of Sally Floyd HighSpeed TCP for Large Congestion Windows. Calibrating appropriate values, we can use bigger congestion window sizes with very lower RTTs between congestion events and achieve a greater throughput.

HighSpeed TCP will have to modify the increase parameter a(w) and the decrease parameter b(w) of AIMD (Additive Increase and Multiplicative Decrease). For Standard TCP, a(w) = 1 and b(w) = 1/2, regardless of the value of w. HighSpeed TCP uses the same values of a(w) and b(w) for w lower than Low_Window. Parameters a(w) and b(w) for HighSpeed TCP are specified below.

HSTCP_aw.png

HSTCP_bw.png

High_Decrease = 0,1 means decrease of 10% of congestion window when congestion event occurre.

For example, for w = 83000, parameters a(w) and b(w) for Standard TCP and HSTCP are specified in table below.

TCPHSTCP
a(w)172
b(w)0.50.1

Summary

Based on the results of HighSpeed TCP and Standard TCP connections that shared bandwidth together we obtain following conclusions. HSTCPs share fairly available bandwidth between themselves. HSTCP may take longer time to converge than Standard TCP. The great disadvantage is that HighSpeed TCP flow starves a TCP flow even in relatively low/moderate bandwidth. The author of the HSTCP specification justifies that there are not a lot of TCP connections effectively operating in this regime today, with large congestion windows, and that therefore the benefits of the HighSpeed response function would outweigh the unfairness that would be experienced by Standard TCP in this regime. Another benefit of applying HSCTP is that it is easier to deploy and no router support needed.

Implementations

HS-TCP is included in Linux as a selectable option in the modular TCP congestion control framework. A few problems with the implementation in Linux 2.6.16 were found by D.X. Wei (see references). These problems were fixed as of Linux 2.6.19.2.

A proposed OpenSolaris project foresees the implementation of HS-TCP and several other congestion control algorithms for OpenSolaris.

References

-- WojtekSronek - 13 Jul 2005
-- HankNussbacher - 06 Oct 2005
-- SimonLeinen - 03 Dec 2006 - 17 Nov 2009

Hamilton TCP

Hamilton TCP (H-TCP), developed by Hamilton Institute, is an enhancement to existing TCP congestion control protocol, with good mathematic and empirical basis. The authors makes an assumption that their improvement should behave as regular TCP in standard LAN and WAN networks where RTT and link capacity are not extremely high. But in the long distant fast networks, H-TCP is far more aggressive and more flexible while adjusting to available bandwidth and avoiding packet losses. Therefore two modes are used for congestion control and mode selection depends on the time since the last experienced congestion. In regular TCP window control algorithm, the window is increased by adding a constant value α and decreased by multiplying by β. The corresponding values are by default equals to 1 and 0.5. In H-TCP case the window estimation is a bit more complex, because both factors are not constant values, but are calculated during transmission.

The parameter values are estimated as follows. For each acknowledgment set:

HTCP_alpha.png

and then

HTCP_alpha2.png

On each congestion event set:

HTCP_beta.png

Where,

  • Δ_i is the time that elapsed since the last congestion experienced by i'th source,
  • Δ_L is the threshold for switching between fast and slow modes,
  • B_i(k) is the throughput of flow i immediately before the k'th congestion event,
  • RTT_max,i and RTT_min,i are maximum and minimum round trip time values for the i'th flow.

The α value depends on the time between experienced congestion events, while β depends on RTT and achieved bandwidth values. The constant values, as well as the Δ_L value, can be modified, in order to adjust the algorithm behavior to user expectations, but the default values are estimated empirically and seem to be the most suitable in most cases. The figures below present the behavior of congestion window, throughput and RTT values during tests performed by the H-TCP authors. In the first case β value was set to 0.5, in the second case the adaptive approach (explained with the equations above) was used. The throughput in the second case is far more stable and is closing to its maximal value.

HTCP_figure1.jpg

HTCP_figure2.jpg

H-TCP congestion window and throughput behavior (source: �H-TCP: TCP for high-speed and long-distance networks�, D.Leith, R.Shorten).

Summary

Another advantage of H-TCP is fairness and friendliness, which means that the algorithm is not "greedy", and will not consume the whole link capacity, but is able to fairly share the bandwidth with either another H-TCP like transmissions or any other TCP-like transmissions. H-TCP congestion window control improves the dynamic of sending ratio and therefore provides better overall throughput value than regular TCP implementation. On the Hamilton Institute web page (http://www.hamilton.ie/net/htcp/) more information is available, including papers, test results and algorithm implementations for Linux kernels 2.4 and 2.6.

References

-- WojtekSronek - 13 Jul 2005 -- SimonLeinen - 30 Jan 2006 - 14 Apr 2008

TCP Westwood

The current Standard TCP implementation rely on packet loss as an indicator of network congestion. The problem in TCP is that it does not possess the capability to distinguish congestion loss from loss invoked by noisy links. As a consequence, Standard TCP reacts with a drastic reduction of the congestion window. In wireless connections overlapping radio channels, signal attenuation, additional noises have a huge impact on such losses.

TCP Westwood (TCPW) is a small modification of Standard TCP congestion control algorithm. When the sender perceive that congestion has appeared, the sender uses the estimated available bandwidth to set the congestion window and the slow start threshold sizes. TCP Westwood avoids huge reduction of these values and ensure both faster recovery and effective congestion avoidance. It does not require any support from lower and higher layers and does not need any explicit congestion feedback from the network.

Measurement of bandwidth in TCP Westwood lean on simple relationship between amount of data sent by the sender to the time of receiving an acknowledgment. The estimation bandwidth process is a bit similar to the one used for RTT estimation in Standard TCP with additional improvements. When there is an absence of ACKs because no packets were sent, the estimated value goes to zero. The formulas of estimation bandwidth process is shown in many documents about TCP Westwood. Such documents are listed on the UCLA Computer Science Department website.

There are some issues that can potentially disrupt the bandwidth estimation process such a delayed or cumulative ACKs indicating wrong order of received segments. So the source must keep track of the number of duplicated ACKs and should be able to detect delayed ACKs and act accordingly.

As it was said before the general idea is to use the bandwidth estimate value to set the congestion window and the slow start threshold after a congestion episode. Below there is an algorithm of setting these values.

if (n duplicate ACKs are received) {
   ssthresh = (BWE * RTTmin)/seg_size;
   if (cwin > ssthresh) cwin = ssthresh; /* congestion avoid. */
}

where

  • BWE is a estimated value of bandwidth,
  • RTTmin is the smallest RTT value observed during the connection time,
  • seg_size is a length of TCP’s segment payload.

Additional benefits

Based on the results of testing TCP Westwood we can see that its fairness is at least as good, than in widely used Standard TCP even if two flows have different round trip times. The main advantage of new implementation of TCP is that it does not starve the others variants of TCP. So, we can claim that TCP Westwood is also friendly.

References

-- WojtekSronek - 27 Jul 2005

TCP Selective Acknowledgements (SACK)

Selective Acknowledgements are a refinement of TCP's traditional "cumulative" acknowledgements.

SACKs allow a receiver to acknowledge non-consecutive data, so that the sender can retransmit only what is missing at the receiver�s end. This is particularly helpful on paths with a large bandwidth-delay product (BDP).

TCP may experience poor performance when multiple packets are lost from one window of data. With the limited information available from cumulative acknowledgments, a TCP sender can only learn about a single lost packet per round trip time. An aggressive sender could choose to retransmit packets early, but such retransmitted segments may have already been successfully received.

A Selective Acknowledgment (SACK) mechanism, combined with a selective repeat retransmission policy, can help to overcome these limitations. The receiving TCP sends back SACK packets to the sender informing the sender of data that has been received. The sender can then retransmit only the missing data segments.

Multiple packet losses from a window of data can have a catastrophic effect on TCP throughput. TCP uses a cumulative acknowledgment scheme in which received segments that are not at the left edge of the receive window are not acknowledged. This forces the sender to either wait a roundtrip time to find out about each lost packet, or to unnecessarily retransmit segments which have been correctly received. With the cumulative acknowledgment scheme, multiple dropped segments generally cause TCP to lose its ACK-based clock, reducing overall throughput. Selective Acknowledgment (SACK) is a strategy which corrects this behavior in the face of multiple dropped segments. With selective acknowledgments, the data receiver can inform the sender about all segments that have arrived successfully, so the sender need retransmit only the segments that have actually been lost.

The selective acknowledgment extension uses two TCP options. The first is an enabling option, "SACK-permitted", which may be sent in a SYN segment to indicate that the SACK option can be used once the connection is established. The other is the SACK option itself, which may be sent over an established connection once permission has been given by SACK-permitted.

Blackholing Issues

Enabling SACK globally used to be somewhat risky, because in some parts of the Internet, TCP SYN packets offering/requesting the SACK capability were filtered, causing connection attempts to fail. By now, it seems that the increased deployment of SACK has caused most of these filters to disappear.

Performance Issues

Sometimes it is not recommended to enable SACK feature, e.g. for the Linux 2.4.x TCP SACK implementation suffers from significant performance degradation in case of a burst of packet loss. People from CERN observed that a burst of packet loss considerably affects TCP connections with large bandwidth delay product (several MBytes), because the TCP connection doesn't recover as it should. After a burst of loss, the throughput measured in their testbed is close to zero during 70 seconds. This behavior is not compliant with TCP's RFC. A timeout should occur after a few RTTs because one of the loss couldn't be repaired quickly enough and the sender should go back into slow start.

For more information take a look at http://sravot.home.cern.ch/sravot/Networking/TCP_performance/tcp_bug.htm

Additionally, work done at Hamilton Institute found that SACK processing in the Linux kernel is inefficient even for later 2.6 kernels, where at 1Gbps networks performance for a long file transfer can lose about 100Mbps of potential. Most of these issues should have been fixed in Linux kernel 2.6.16.

Detailed Explanation

The following is closely based on a mail that Baruch Even sent to the pert-discuss mailing list on 25 Jan '07:

The Linux TCP SACK handling code has in the past been extremely inefficient - there were multiple passes on a linked list which holds all the sent packets. This linked list on a large BDP link can span 20,000 packets. This meant multiple traversals of this list take longer than it takes for another packet to come. Pretty quickly after a loss with SACK the sender's incoming queue fills up and ACKs start getting dropped. There used to be an anti-DoS mechanism that would drop all packets until the queue had emptied - with a default of 1000 packets that took a long time and could easily drop all the outstanding ACKs resulting in a great degradation of performance. This value is set with proc/sys/net/core/netdev_max_backlog

This situation has been slightly improved in that though there is still a 1000 packet limit for the network queue, it acts as a normal buffer i.e. packets are accepted as soon as the queue dips below 1000 again.

Kernel 2.6.19 should be the preferred kernel now for high speed networks. It is believed it has the fixes for all former major issues (at least those fixes that were accepted to the kernel), but it should be noted that some other (more minor) bugs have appeared, and will need to be fixed in future releases.

Historical Note

Selective acknowledgement schemes were known long before they were added to TCP. Noel Chiappa mentioned that Xerox' PUP protocols had them in a posting to the tcp-ip mailing list in August 1986. Vint Cerf responds with a few notes about the thinking that lead to the cumulative form of the original TCP acknowledgements.

References

-- UlrichSchmid & SimonLeinen - 02 Jun 2005 - 14 Jan 2007
-- BartoszBelter - 16 Dec 2005
-- BaruchEven - 05 Jan 2006

Automatic Tuning of TCP Buffers

Note: This mechanism is sometimes referred to as "Dynamic Right-Sizing" (DRS).

The issues mentioned under "Large TCP Windows" are arguments in favor of "buffer auto-tuning", a promising but relatively new approach to better TCP performance in operating systems. See the TCP auto-tuning zoo reference for a description of some approaches.

Microsoft introduced (receive-side) buffer auto-tuning in Windows Vista. This implementation is explained in a TechNet Magazine "Cable Guy" article.

FreeBSD introduced buffer auto-tuning as part of its 7.0 release.

Mac OS X introduced buffer auto-tuning in release 10.5.

Linux auto-tuning details

Some automatic buffer tuning is implemented in Linux 2.4 (sender-side), and Linux 2.6 implements it for both the send and receive directions.

In a post to the web100-discuss mailing list, John Heffner describes the Linux 2.6.16 (March 2006) Linux implementation as follows:

For the sender, we explored separating the send buffer and retransmit queue, but this has been put on the back burner. This is a cleaner approach, but is not necessary to achieve good performance. What is currently implemented in Linux is essentially what is described in Semke '98, but without the max-min fair sharing. When memory runs out, Linux implements something more like congestion control for reducing memory. It's not clear that this is well-behaved, and I'm not aware of any literature on this. However, it's rarely used in practice.

For the receiver, we took an approach similar to DRS, but not quite the same. RTT is measured with timestamps (when available), rather than using a bounding function. This allows it to track a rise in RTT (for example, due to path change or queuing). Also, a subtle but important difference is that receiving rate is measured by the amount of data consumed by the application, not data received by TCP.

Matt Mathis reports on the end2end-interest mailing list (26/07/06):

Linux 2.6.17 now has sender and receiver side autotuning and a 4 MB DEFAULT maximum window size. Yes, by default it negotiates a TCP window scale of 7.
4 MB is sufficient to support about 100 Mb/s on a 300 ms path or 1 Gb/s on a 30 ms path, assuming you have enough data and an extremely clean (loss-less) network.

References

  • TCP auto-tuning zoo, Web page by Tom Dunegan of Oak Ridge National Laboratory, PDF
  • A Comparison of TCP Automatic Tuning Techniques for Distributed Computing, E. Weigle, W. Feng, 2002, PDF
  • Automatic TCP Buffer Tuning, J. Semke, J. Mahdavi, M. Mathis, SIGCOMM 1998, PS
  • Dynamic Right-Sizing in TCP, M. Fisk, W. Feng, Proc. of the Los Alamos Computer Science Institute Symposium, October 2001, PDF
  • Dynamic Right-Sizing in FTP (drsFTP): Enhancing Grid Performance in User-Space, M.K. Gardner, Wu-chun Feng, M. Fisk, July 2002, PDF. This paper describes an implementation of buffer-tuning at application level for FTP, i.e. outside of the kernel.
  • Socket Buffer Auto-Sizing for High-Performance Data Transfers, R. Prasad, M. Jain, C. Dovrolis, PDF
  • The Cable Guy: TCP Receive Window Auto-Tuning, J. Davies, January 2007, Microsoft TechNet Magazine
  • What's New in FreeBSD 7.0, F. Biancuzzi, A. Oppermann et al., February 2008, ONLamp
  • How to disable the TCP autotuning diagnostic tool, Microsoft Support, Article ID: 967475, February 2009

-- SimonLeinen - 04 Apr 2006 - 21 Mar 2011
-- ChrisWelti - 03 Aug 2006

Application Protocols

-- TobyRodwell - 28 Feb 2005

File Transfer

A common problem for many scientific applications is the replication of - often large - data sets (files) from one system to another. (For the generalized problem of transferring data sets from a source to multiple destinations, see DataDissemination.) Typically this requires reliable transfer (protection against transmission errors) such as provided by TCP, typically access control based on some sort of authentication, and sometimes confidentiality against eavesdroppers, which can be provided by encryption. There are many protocols that can be used for file transfer, some of which are outlined here.

  • FTP, the File Transfer Protocol, was one of the earliest protocols used on the ARPAnet and the Internet, and predates both TCP and IP. It supports simple file operations over a variety of operating systems and file abstractions, and has both a text and a binary mode. FTP uses separate TCP connections for control and data transfer.
  • HTTP, the Hypertext Transfer Protocol, is the basic protocol used by the World Wide Web. It is quite efficient for transferring files, but is typically used to transfer from a server to a client only.
  • RCP, the Berkeley Remote Copy Protocol, is a convenient protocol for transferring files between Unix systems, but lacks real security beyond address-based authentication and clear-text passwords. Therefore it has mostly fallen out of use.
  • SCP is a file-transfer application of the SSH protocol. It provides various modern methods of authentication and encryption, but its current implementations have some performance limitations over "long fat networks" that are addressed under the SSH topic.
  • BitTorrent is an example of a peer-to-peer file-sharing protocol. It employs local control mechanisms to optimize the global problem of replicating a large file to many recipients, by allowing peers to share partial copies as they receive them.
  • VFER is a tool for high-performance data transfer developed at Internet2. It is layered on UDP and implements its own delay-based rate control algorithm in user-space, which is designed to be "TCP friendly". Its security is based on SSH.
  • UDT is another UDP-based bulk transfer protocol, optimized for high-capacity (1 Gb/s and above) wide-area network paths. It has been used in the winning entry at the Supercomputing'06 Bandwidth Challenge.

Several high-performance file transfer protocols are used in the Grid community. The "comparative evaluation..." paper in the references compares FOBS, RBUDP, UDT, and bbFTP. Other protocols include GridFTP, Tsunami and FDT. The eVLBI community uses file transfer tools from the Mark5 software suite: File2Net and Net2File. The ESnet "Fasterdata" knowledge base has a very nice section on Data Transfer Tools, providing both general background information and information about several specific tools. Another useful document is Harry Mangalam's How to transfer large amounts of data via network—a nicely written general introduction to the problem of moving data with many usage examples of specialized tools, including performance numbers and tuning hints.

Network File Systems

Another possibility of exchanging files over the network involves networked file systems, which make remote files transparently accessible in a local system's normal file namespace. Examples for such file systems are:

  • NFS, the Network File System, was initially developed by Sun and is widely utilized on Unix-like systems. Very recently, NFS 4.1 added support for pNFS (parallel NFS), where data access can be striped over multiple data servers.
  • AFS, the Andrew File System from CMU, evolved into DFS (Distributed File System)
  • SMB (Server Message Block) or CIFS (Common Internet File System) is the standard protocol for connecting to "network shares" (remote file systems) in the Windows world.
  • GPFS (General Parallel File System) is a high-performance scalable network file system by IBM.
  • Lustre is an open-source file systems for high-performance clusters, distributed by Sun.

References

-- TobyRodwell - 2005-02-28
-- SimonLeinen - 2005-06-26 - 2015-04-22

-- TobyRodwell - 28 Feb 2005

Secure Shell (SSH)

SSH is a widely used protocol for remote terminal access with secure authentication and data encryption. It is also used for file transfers, using tools such as scp (Secure Copy), sftp (Secure FTP), or rsync-over-ssh.

Performance Issues With SSH

Application Layer Window Limitation

When users use SSH to transfer large files, they often think that performance is limited by the processing power required for encryption and decryption. While this can indeed be an issue in a LAN context, the bottleneck over "long fat networks" (LFNs) is most likely a window limitation. Even when TCP parameters have been tuned to allow sufficiently large TCP Windows, the most common SSH implementation (OpenSSH) has a hardwired window size at the application level. Until OpenSSH 4.7, the limit was ~64K but since then, the limit was increased 16-fold (see below) and window increase logic was made more aggressive.

This limitation is replaced with a more advanced logic in a modification of the OpenSSH software provided by the Pittsburgh Supercomputing Center (see below).

The performance difference is substantial especially when RTT grows. In a test setup, with 45 ms RTT, two Linux systems with 8 MB read/write buffers could achive 1.5 MB/s performance with regular OpenSSH (3.9p1 + 4.3p1). Switching to OpenSSH 5.1p1 + HPN-SSH patches on both ends allow up to 55-70 MB/s (no encryption) or 35/50 MB/s (aes128-cbc/ctr encryption) , with the stable rate somewhat lower, the bottleneck being CPU on one end. By just upgrading the receiver (client) side, transfer could still reach 50 MB/s (with encryption).

Crypto overhead

When the window-size limitation is removed, encryption/decryption performance may become the bottleneck again. So it is useful to choose a "cipher" (encryption/decryption method) that performs well, while still being regarded as sufficiently secure to protect the data in question. Here is a table that displays the performance of several ciphers supported by OpenSSH in a reference setting:

cipher throughput
3des-cbc 2.8MB/s
arcfour 24.4MB/s
aes192-cbc 13.3MB/s
aes256-cbc 11.7MB/s
aes128-ctr 12.7MB/s
aes192-ctr 11.7MB/s
aes256-ctr 11.3MB/s
blowfish-cbc 16.3MB/s
cast128-cbc 7.9MB/s
rijndael-cbc@lysator.liu.se 12.2MB/s

The High Performance Enabled SSH/SCP (HPN-SSH) version also supports an option to the scp program that supports use of the "none" cipher, when confidentiality protection of the transferred data is not required. The program also supports a cipher-switch option where password authentication can be encrypted but the transferred data not.

References

Basics

SSH Performance

-- ChrisWelti - 03 Apr 2006
-- SimonLeinen - 12 Feb 2005 - 26 Apr 2010
-- PekkaSavola - 01 Oct 2008

BitTorrent

BitTorrent is an example of a peer-to-peer file-sharing protocol. It employs local control mechanisms to optimize the global problem of replicating a large file to many recipients, by allowing peers to share partial copies as they receive them. It was developed by Bram Cohen, and is widely used to distribute large files over the Internet. Because much of this copying is for audio and video data and without the accordance of the rights owners of that material, BitTorrent has become a focus of attention of media interest groups such as the Motion Picture Artists of America (MPAA). But BitTorrent is also used to distribute large software archives under "Free Software" or similar legal-redistribution regimes.

References

-- SimonLeinen - 26 Jun 2005

Troubleshooting Procedures

PERT Troubleshooting Procedures, as used in the earlier centralised PERT in GN2 (2004-2008), are laid down in GN2 Deliverable DS3.5.2 (see references section), which also contains the User Guide for using the (now-defunct) PERT Ticket System (PTS).

Standard tests

  • Ping - simple RTT/loss measurement using ICMP ECHO requests
  • Fping - ping variant with concurrent measurement of multiple destinations

End-user tests

tweak tools http://www.dslreports.com/tweaks is able to run a simple end-user test, checking such parameters as TCP options, receive window size, data transfer rates. It is all done through the user's web-browser making it a simple test for them to do.

Linux Tools and Checks

ip route show cache

Linux is able to apply specific conditions (MSS, ssthresh) to specific routes. Some parameters (such as MSS, ssthresh) can be set manually (with ip route add|replace ...) whilst others are changed automatically by Linux in response to what it learns from TCP (such parameters include estimated rtt, cwnd and re-ordering). The learned info is stored in the route cache and thus can be shown with ip route show cache. Note, this learning behaviour can actually limit TCP performance - if the last transfer was poor, then the starting TCP parameters will be be pessimistic. For this reason some tools, e.g. bwctl, always flush the route cache before starting a test.

References

  • PERT Troubleshooting Procedures - 2nd Edition, B. Belter, A. Harding, T. Rodwell, GN2 deliverable D3.5.2, August 2006 (PDF)

-- SimonLeinen - 18 Sep 2005 - 17 Aug 2010
-- BartoszBelter - 28 Mar 2006

Measurement Tools

Traceroute-like Tools: traceroute, MTR, PingPlotter, lft, tracepath, traceproto

There is a large and growing number of path-measurement tools derived from the well-known traceroute tool. Those tools all attempt to find the route a packet will take from the source (typically where the tool is running) to a given destination, and to find out some hop-wise performance parameters along the way.

Bandwidth Measurement Tools: pchar, Iperf, bwctl, nuttcp, Netperf, RUDE/CRUDE, ttcp, NDT, DSL Reports

Use: Evaluate the Bandwidth between two points in the network.

Active Measurement Boxes

Use: information about Delay, Jitter and packets loss between two probes.

Passive Measurement Tools

These tools perform their measurement by looking at existing traffic. They can be further classified into

References

-- FrancoisXavierAndreu & SimonMuyal - 06 Jun 2005
-- HankNussbacher - 01 Jun 2005
-- SimonLeinen - 2005-07-15 - 2014-12-27
-- AlessandraScicchitano - 2013-01-30

traceroute-like Tools

Traceroute is used to determine the route a packet takes through the Internet to reach its destination; i.e. the sequence of gateways or "hops" it passes through. Since its inception, traceroute has been widely used for network diagnostics as well as for research in the widest sense.

The basic idea is to send out "probe" packets with artificially small TTL (time-to-live) values, eliciting ICMP "time exceeded" messages from routers in the network, and reconstructing the path from these messages. This is described in more detail under the VanJacobsonTraceroute topic. This original traceroute implementation was followed by many attempts at improving on the idea in various directions: More useful output (often graphically enhanced, sometimes trying to map the route geographically), more detailed measurements along the path, faster operation for many targets ("topology discovery"), and more robustness in the face of packet filters, using different types of suitable probe packets.

Traceroute Variants

Close cousins

Variants with more filter-friendly probe packets

Extensions

GUI (Graphical User Interface) traceroute variants

  • 3d Traceroute plots per-hop RTTs over time in 3D. Works on Windows, free demo available
  • LoriotPro graphical traceroute plugin. Several graphical representations.
  • Path Analyzer Pro, graphical traceroute tool for Windows and MacOS X. Include "synopsis" feature.
  • PingPlotter combines traceroute, ping and whois to collect data for Windows platform.
  • VisualRoute performs traceroutes and provides various graphical representations and analysis. Runs on Windows, Web demo available.

More detailed measurements along the path

Other

  • Multicast Traceroute (mtrace) uses IGMP protocol extensions to allow "traceroute" functionality for multicast
  • Paris Traceroute claims to produce better results in the presence of "multipath" routing.
  • Scamper performs bulk measurements to large numbers of destinations, under configurable traffic limits.
  • tracefilter tries to detect where along a path packets are filtered.
  • Tracebox tries to detect middleboxes along the path
  • traceiface tries to find "both ends" of each link traversed using expanding-ring search.
  • Net::Traceroute::PurePerl is an implementation of traceroute in Perl that lends itself to experimentation.
  • traceflow is a proposed protocol to trace the path for traffic of a specific 3/5-tuple and collect diagnostic information

Another list of traceroute variants, with a focus on (geo)graphical display capabilities, can be found on the JRA1 Wiki.

Traceroute Servers

There are many TracerouteServers on the Internet that allow running traceroute from other parts of the network.

Traceroute Data Collection and Representation

Researchers from the network measurement community have created large collections of traceroute results to help understand the Internet's topology, i.e. the structure of its connectivity. Some of these collections are available for other researchers. Scamper is an example of a tool that can be used to efficiently obtain traceroute results towards a large number of destinations.

The IETF IPPM WG is standardizing an XML-based format to store traceroute results.

-- FrancoisXavierAndreu - SimonMuyal - 06 Jun 2005
-- SimonLeinen - 2005-05-06 - 2015-09-11

Original Van Jacobson/Unix/LBL Traceroute

The well known traceroute program has been written by Van Jacobson in December 1988. In the header comment of the source, Van explains that he just implemented an idea he got from Steve Deering (and that could have come from Guy Almes or Matt Mathis). Traceroute sends "probe" packets with TTL (Time to Live) values incrementing from one, and uses ICMP "Time Exceeded" messages to detect router "hops" on the way to the specified destination. It also records "response" times for each hop, and displays losses and other types of failures in a compact way.

This traceroute is also known as "Unix traceroute" (although some Unix variants have departed from the code base, see e.g. Solaris traceroute) or "LBL traceroute" after Van Jacobson's employer at the time he wrote this.

Basic function

UDP packets are sent as probes to a high ephemeral port (usually in the range 33434--33525) with the Time-To-Live (TTL) field in the IP header increasing by one for each probe sent until the end host is reached. The originating host listens for ICMP Time Exceeded responses from each of the routers/hosts en-route. It knows that the packet's destination has been reached when it receives an ICMP Port Unreachable message; we expect a port unreachable message as no service should be listening for connections in this port range. If there is no response to the probe within a certain time period (typically 5 seconds), then a * is displayed.

Output

The output of the traceroute program shows each host that the packet passes through on its way to its destination and the RTT to each gateway en-route. Occasionally, the maximum number of hops (specified by the TTL field, which defaults to 64 hops in *NIX implementations) is exceeded before the port unreachable is received. When this happens an ! will be printed beside the RTT in the output.

Other error messages that may appear after the RTT in the output of a traceroute are:

!H
Host unreachable
!N
Network unreachable
!P
Protocol unreachable
!S
Source-route failed (that is to say, the router was not able to honour the source-route option set in an IP packet)
!F[pmtu]
Fragmentation needed. [pmtu] displays the Path MTU Discovery value, typically the "next-hop MTU" contained in the ICMP response packet.
!X
Administratively prohibited. The gateway prohibits these packets, but sends an ICMP message back to the source of the traceroute to inform them of this.
!V
Host precedence violation
!C
Precedence cut-off in effect
![num]
displays the ICMP unreachable code, as defined in a variety of RFC, shown at http://www.iana.org/assignments/icmp-parameters

UDP Probes

The original traceroute sends UDP packets as probes to a high ephemeral port (usually in the range 33434--33525) with the Time-To-Live (TTL) field in the IP header increasing by one for each probe sent until the end host is reached. The originating host listens for ICMP Time Exceeded responses from each of the routers/hosts en-route. It knows that the packet's destination has been reached when it receives an ICMP Port Unreachable message; we expect a port unreachable message as no service should be listening for connections in this port range. If there is no response to the probe within a certain time period (typically 5 seconds), then a * is displayed.

Why UDP?

It would seem natural to use ICMP ECHO requests as probe packets, but Van Jacobson chose UDP packets to presumably unused ports instead. It is believed that this is because at that time, some gateways (as routers were called then) refused to send ICMP (TTL exceeded) messages in response to ICMP messages, as specified in the introduction of RFC 792, "Internet Control Message Protocol". Therefore the UDP variant was more robust.

These days, all gateways (routers) send ICMP TTL Exceeded messages about ICMP ECHO request packets (as specified in RFC1122, "Requirements for Internet Hosts -- Communication Layers", so more recent traceroute versions (such as Windows tracert) do indeed use ICMP probes, and newer Unix traceroute versions allow ICMP probes to be selected with the -I option.

Why increment the UDP port number?

Traceroute varies (increments) the UDP destination port number for each probe sent out, in order to reliably match ICMP TTL Exceeded messages to individual probes. Because the UDP ports occur right after the IP header, they can be relied on to be included in the "original packet" portion of the ICMP TTL Exceeded messages, even though the ICMP standards only mandate that the first eight octets following the IP header of the original packet be included in ICMP messages (it is allowed to send more though).

When ICMP ECHO requests are used, the probes can be disambiguated by using the sequence number field, which also happens to be located just before that 8-octet boundary.

Filters

Note that either or both of ICMP and UDP may be blocked by firewalls, so this must be taken into account when troubleshooting. As an alternative, one can often use traceroute variants such as lft or tcptraceroute to work around these filters, provided that the destination host has at least one TCP port open towards the source.

References

  • ftp://ftp.ee.lbl.gov/ - the original distribution site. The latest traceroute version found as of February 2006 is traceroute-1.4a12 from December 2000.
  • original announcement from the archives of the end2end-interest mailing list

-- SimonLeinen - 26 Feb 2006

Traceroute6

Traceroute6 uses the Hop-Limit field of the IPv6 protocol to elicit an ICMPv6 Time Exceeded ICMPv6 message from each gateway ("hop") along the path to some host. Just as with traceroute, it prints the route to the given destination and the RTT to each gateway/router.
The following are a list of possible errors that may appear after the RTT for a gateway (especially for OSes that use the KAME IPv6 network stack, such as the BSDs):

!N
No route to host
!P
Administratively prohibited (i.e. Blocked by a firewall, but the firewall issues an ICMPv6 message to the originating host to inform them of this)
!S
Not a Neighbour
!A
Address unreachable
!
The hop-limit is <= 1 on a Port Unreachable ICMPv6 message. This means that the packet got to its destination, but that the reply only had a hop-limit large enough that was just large enough to allow it to get back to the source of the traceroute6. This option was more interesting in IPv4, where bugs in some implementations of the IP stack could be identified by this behaviour.

Traceroute6 can also be specified to use ICMPv6 Echo messages to send the probe packets, instead of the default UDP probes, by specifying the -I flag when running the program. This may be useful in situations where UDP packets are blocked by a packet filter or firewall, while ICMP ECHO requests are permitted.

Note that on some systems, notably Sun's Solaris (see SolarisTraceroute), IPv6 functionality is integrated in the normal traceroute program.

-- TobyRodwell - 06 Apr 2005
-- SimonLeinen - 26 Feb 2006

Solaris traceroute (includes IPv6 functionality)

On Sun's Solaris, IPv6 traceroute functionality is included in the normal traceroute program, so a separate "traceroute6" isn't necessary. Sun's traceroute program has additional options to select between IPv4 and IPv6:

-A [inet|inet6]
Resolve hostnames to an IPv4 (inet) or IPv6 (inet6) address only

-a
Perform traceroutes to all addresses in all address families found for the given destination name.

Examples

Here are a few examples of the address-family selection switches for Solaris' traceroute. By default, IPv6 is preferred:

: leinen@diotima[leinen]; traceroute cemp1
traceroute: Warning: cemp1 has multiple addresses; using 2001:620:0:114:20b:cdff:fe1b:3d1a
traceroute: Warning: Multiple interfaces found; using 2001:620:0:4:203:baff:fe4c:d751 @ bge0:1
traceroute to cemp1 (2001:620:0:114:20b:cdff:fe1b:3d1a), 30 hops max, 60 byte packets
 1  swiLM1-V4.switch.ch (2001:620:0:4::1)  0.683 ms  0.757 ms  0.449 ms
 2  swiNM1-V610.switch.ch (2001:620:0:c047::1)  0.576 ms  0.576 ms  0.463 ms
 3  swiCS3-G3-3.switch.ch (2001:620:0:c046::1)  0.461 ms  0.334 ms  0.340 ms
 4  swiEZ2-P1.switch.ch (2001:620:0:c03f::2)  0.467 ms  0.332 ms  0.348 ms
 5  swiLS2-10GE-1-1.switch.ch (2001:620:0:c03c::1)  3.976 ms  3.825 ms  3.729 ms
 6  swiCE2-10GE-1-3.switch.ch (2001:620:0:c006::1)  4.817 ms  4.703 ms  4.740 ms
 7  cemp1-eth1.switch.ch (2001:620:0:114:20b:cdff:fe1b:3d1a)  4.583 ms  4.566 ms  4.590 ms

If IPv4 is desired, this can be selected using -A inet:

: leinen@diotima[leinen]; traceroute -A inet cemp1
traceroute to cemp1 (130.59.35.130), 30 hops max, 40 byte packets
 1  swiLM1-V4.switch.ch (130.59.4.1)  0.643 ms  0.539 ms  0.465 ms
 2  swiNM1-V610.switch.ch (130.59.15.229)  0.453 ms  0.553 ms  0.470 ms
 3  swiCS3-G3-3.switch.ch (130.59.15.238)  0.590 ms  0.426 ms  0.476 ms
 4  swiEZ2-P1.switch.ch (130.59.36.222)  0.463 ms  0.307 ms  0.352 ms
 5  swiLS2-10GE-1-1.switch.ch (130.59.36.205)  3.723 ms  3.755 ms  3.743 ms
 6  swiCE2-10GE-1-3.switch.ch (130.59.37.1)  4.677 ms  4.678 ms  4.690 ms
 7  cemp1-eth1.switch.ch (130.59.35.130)  5.028 ms  4.555 ms  4.567 ms

The -a switch can be used to trace to all addresses:

: leinen@diotima[leinen]; traceroute -a cemp1
traceroute: Warning: Multiple interfaces found; using 2001:620:0:4:203:baff:fe4c:d751 @ bge0:1
traceroute to cemp1 (2001:620:0:114:20b:cdff:fe1b:3d1a), 30 hops max, 60 byte packets
 1  swiLM1-V4.switch.ch (2001:620:0:4::1)  0.684 ms  0.515 ms  0.457 ms
 2  swiNM1-V610.switch.ch (2001:620:0:c047::1)  0.580 ms  0.848 ms  0.561 ms
 3  swiCS3-G3-3.switch.ch (2001:620:0:c046::1)  0.304 ms  0.428 ms  0.315 ms
 4  swiEZ2-P1.switch.ch (2001:620:0:c03f::2)  0.455 ms  0.516 ms  0.397 ms
 5  swiLS2-10GE-1-1.switch.ch (2001:620:0:c03c::1)  3.853 ms  3.826 ms  3.874 ms
 6  swiCE2-10GE-1-3.switch.ch (2001:620:0:c006::1)  5.071 ms  4.654 ms  4.702 ms
 7  cemp1-eth1.switch.ch (2001:620:0:114:20b:cdff:fe1b:3d1a)  4.581 ms  4.564 ms  4.585 ms

traceroute to cemp1 (130.59.35.130), 30 hops max, 40 byte packets
 1  swiLM1-V4.switch.ch (130.59.4.1)  0.677 ms  0.523 ms  0.716 ms
 2  swiNM1-V610.switch.ch (130.59.15.229)  0.462 ms  0.558 ms  0.470 ms
 3  swiCS3-G3-3.switch.ch (130.59.15.238)  0.340 ms  0.309 ms  0.352 ms
 4  swiEZ2-P1.switch.ch (130.59.36.222)  0.341 ms  0.307 ms  0.351 ms
 5  swiLS2-10GE-1-1.switch.ch (130.59.36.205)  3.722 ms  3.684 ms  3.719 ms
 6  swiCE2-10GE-1-3.switch.ch (130.59.37.1)  4.794 ms  4.695 ms  4.658 ms
 7  cemp1-eth1.switch.ch (130.59.35.130)  4.645 ms  4.653 ms  4.587 ms

-- SimonLeinen - 26 Feb 2006

NANOG Traceroute

Ehud Gavron maintains this version of traceroute. It is derived from the original traceroute program, but adds a few features such as AS (Autonomous System) number lookup, and detection of TOS (Type-of-Service) changes along the path.

TOS Change Detection

This feature can be used to check whether a network path is DSCP-transparent. Use the -t option to select the TOS byte, and watch out for TOS byte changes indicated with TOS=x!.

As shown in the following example, GEANT2 accepts TOS=32/DSCP=8/CLS=1 - this corresponds to "LBE"

: root@diotima[nanog]; ./traceroute -t 32 www.dfn.de.
traceroute to sirius.dfn.de (192.76.176.5), 30 hops max, 40 byte packets
1 swiLM1-V4.switch.ch (130.59.4.1) 1 ms 1 ms 0 ms
2 swiNM1-V610.switch.ch (130.59.15.229) 0 ms 1 ms 0 ms
3 swiCS3-G3-3.switch.ch (130.59.15.238) 0 ms 0 ms 0 ms
4 swiEZ2-P1.switch.ch (130.59.36.222) 0 ms 0 ms 0 ms
5 swiLS2-10GE-1-1.switch.ch (130.59.36.205) 4 ms 4 ms 4 ms
6 swiCE2-10GE-1-3.switch.ch (130.59.37.1) 5 ms 5 ms 5 ms
7 switch.rt1.gen.ch.geant2.net (62.40.124.21) 5 ms 5 ms 5 ms
8 so-7-2-0.rt1.fra.de.geant2.net (62.40.112.22) 13 ms 13 ms 13 ms
9 dfn-gw.rt1.fra.de.geant2.net (62.40.124.34) 13 ms 13 ms 13 ms
10 cr-berlin1-po1-0.x-win.dfn.de (188.1.18.53) 25 ms 25 ms 25 ms
11 ar-berlin1-ge6-1.x-win.dfn.de (188.1.20.34) 25 ms 25 ms 25 ms
12 zpl-gw.dfn.de (192.76.176.253) 25 ms 25 ms 25 ms
13 * ^C

On the other hand, TOS=64/DSCP=16/CLS=2 is rewritten to TOS=0 at hop 7 (the rewrite only shows up at hop 8):

: root@diotima[nanog]; ./traceroute -t 64 www.dfn.de.
traceroute to sirius.dfn.de (192.76.176.5), 30 hops max, 40 byte packets
1 swiLM1-V4.switch.ch (130.59.4.1) 2 ms 1 ms 0 ms
2 swiNM1-V610.switch.ch (130.59.15.229) 0 ms 0 ms 0 ms
3 swiCS3-G3-3.switch.ch (130.59.15.238) 0 ms 0 ms 0 ms
4 swiEZ2-P1.switch.ch (130.59.36.222) 0 ms 0 ms 0 ms
5 swiLS2-10GE-1-1.switch.ch (130.59.36.205) 4 ms 4 ms 4 ms
6 swiCE2-10GE-1-3.switch.ch (130.59.37.1) 5 ms 5 ms 5 ms
7 switch.rt1.gen.ch.geant2.net (62.40.124.21) 5 ms 5 ms 5 ms
8 so-7-2-0.rt1.fra.de.geant2.net (62.40.112.22) 13 ms (TOS=0!) 13 ms 13 ms
9 dfn-gw.rt1.fra.de.geant2.net (62.40.124.34) 13 ms 13 ms 13 ms
10 cr-berlin1-po1-0.x-win.dfn.de (188.1.18.53) 25 ms 25 ms 25 ms
11 ar-berlin1-ge6-1.x-win.dfn.de (188.1.20.34) 25 ms 25 ms 25 ms
12 zpl-gw.dfn.de (192.76.176.253) 25 ms 25 ms 25 ms
13 * ^C

References

-- SimonLeinen - 26 Feb 2006

Windows tracert

All Windows versions that come with TCP/IP support include a tracert command, which is a simple traceroute client that uses ICMP probes. The (few) options are slightly different from the Unix variants of traceroute. In short, -d is used to suppress address-to-name resolution (-n in Unix traceroute), -h specifies the maximum hopcount (-m in the Unix version), -j is used to specify loose source routes (-g in Unix), and although the same -w option is used to specify the timeout, tracert interprets it in milliseconds, while in Unix it is specified in seconds.

References

-- SimonLeinen - 26 Feb 2006

TCP Traceroute

TCPTraceroute is a traceroute implementation that uses TCP packets instead of UDP or ICMP packets to send its probes. TCPtraceroute can be used in situations where a firewall blocks ICMP and UDP traffic. It is based on the "half-open scanning" technique that is used by NMAP, sending a TCP with the SYN flag set and waiting for a SYN/ACK (which indicates that something is listening on this port for connections). When it receives a response, the tcptraceroute program sends a packet with a RST flag to close the connection.

References

-- TobyRodwell - 06 Apr 2005

LFT (Layer Four Traceroute)

lft is a variant of traceroute that uses per default TCP and port 80 to go through packet-filter based firewalls.

Example

142:/home/andreu# lft -d 80 -m 1 -M 3 -a 5 -c 20 -t 1000 -H 30 -s 53 www.cisco.com     
Tracing _________________________________.
TTL  LFT trace to www.cisco.com (198.133.219.25):80/tcp
1   129.renater.fr (193.49.159.129) 0.5ms
2   gw1-renater.renater.fr (193.49.159.249) 0.4ms
3   nri-a-g13-0-50.cssi.renater.fr (193.51.182.6) 1.0ms
4   193.51.185.1 0.6ms
5   PO11-0.pascr1.Paris.opentransit.net (193.251.241.97) 7.0ms
6   level3-1.GW.opentransit.net (193.251.240.214) 0.8ms
7   ae-0-17.mp1.Paris1.Level3.net (212.73.240.97) 1.1ms
8   so-1-0-0.bbr2.London2.Level3.net (212.187.128.42) 10.6ms
9   as-0-0.bbr1.NewYork1.Level3.net (4.68.128.106) 72.1ms
10   as-0-0.bbr1.SanJose1.Level3.net (64.159.1.133) 158.7ms
11   ge-7-0.ipcolo1.SanJose1.Level3.net (4.68.123.9) 159.2ms
12   p1-0.cisco.bbnplanet.net (4.0.26.14) 159.4ms
13   sjck-dmzbb-gw1.cisco.com (128.107.239.9) 159.0ms
14   sjck-dmzdc-gw2.cisco.com (128.107.224.77) 159.1ms
15   [target] www.cisco.com (198.133.219.25):80 159.2ms

References

  • Home page: http://pwhois.org/lft/
  • Debian package: lft - display the route packets take to a network host/socket

-- FrancoisXavierAndreu & SimonMuyal - 06 Jun 2005
-- SimonLeinen - 21 May 2006

traceproto

Another traceroute variant which allows different protocols and port to be used. "It currently supports tcp, udp, and icmp traces with the possibility of others in the future." It comes with a wrapper script called HopWatcher, which can be used to quickly detect when a path has changed -

Example

142:/home/andreu# traceproto www.cisco.com
traceproto: trace to www.cisco.com (198.133.219.25), port 80
ttl  1:  ICMP Time Exceeded from 129.renater.fr (193.49.159.129)
        6.7040 ms       0.28100 ms      0.28600 ms
ttl  2:  ICMP Time Exceeded from gw1-renater.renater.fr (193.49.159.249)
        0.16900 ms      6.0140 ms       0.25500 ms
ttl  3:  ICMP Time Exceeded from nri-a-g13-0-50.cssi.renater.fr (193.51.182.6)
        6.8280 ms       0.58200 ms      0.52100 ms
ttl  4:  ICMP Time Exceeded from 193.51.185.1 (193.51.185.1)
        6.6400 ms       7.4230 ms       6.7690 ms
ttl  5:  ICMP Time Exceeded from PO11-0.pascr1.Paris.opentransit.net (193.251.241.97)
        0.58100 ms      0.64100 ms      0.54700 ms
ttl  6:  ICMP Time Exceeded from level3-1.GW.opentransit.net (193.251.240.214)
        6.9390 ms       0.62200 ms      6.8990 ms
ttl  7:  ICMP Time Exceeded from ae-0-17.mp1.Paris1.Level3.net (212.73.240.97)
        7.0790 ms       7.0250 ms       0.79400 ms
ttl  8:  ICMP Time Exceeded from so-1-0-0.bbr2.London2.Level3.net (212.187.128.42)
        10.362 ms       10.100 ms       16.384 ms
ttl  9:  ICMP Time Exceeded from as-0-0.bbr1.NewYork1.Level3.net (4.68.128.106)
        109.93 ms       78.367 ms       80.352 ms
ttl  10:  ICMP Time Exceeded from as-0-0.bbr1.SanJose1.Level3.net (64.159.1.133)
        156.61 ms       179.35 ms
          ICMP Time Exceeded from ae-0-0.bbr2.SanJose1.Level3.net (64.159.1.130)
        148.04 ms
ttl  11:  ICMP Time Exceeded from ge-7-0.ipcolo1.SanJose1.Level3.net (4.68.123.9)
        153.59 ms
         ICMP Time Exceeded from ge-11-0.ipcolo1.SanJose1.Level3.net (4.68.123.41)
        142.50 ms
         ICMP Time Exceeded from ge-7-1.ipcolo1.SanJose1.Level3.net (4.68.123.73)
        133.66 ms
ttl  12:  ICMP Time Exceeded from p1-0.cisco.bbnplanet.net (4.0.26.14)
        150.13 ms       191.24 ms       156.89 ms
ttl  13:  ICMP Time Exceeded from sjck-dmzbb-gw1.cisco.com (128.107.239.9)
        141.47 ms       147.98 ms       158.12 ms
ttl  14:  ICMP Time Exceeded from sjck-dmzdc-gw2.cisco.com (128.107.224.77)
        188.85 ms       148.17 ms       152.99 ms
ttl  15:no response     no response     

hop :  min   /  ave   /  max   :  # packets  :  # lost
      -------------------------------------------------------
  1 : 0.28100 / 2.4237 / 6.7040 :   3 packets :   0 lost
  2 : 0.16900 / 2.1460 / 6.0140 :   3 packets :   0 lost
  3 : 0.52100 / 2.6437 / 6.8280 :   3 packets :   0 lost
  4 : 6.6400 / 6.9440 / 7.4230 :   3 packets :   0 lost
  5 : 0.54700 / 0.58967 / 0.64100 :   3 packets :   0 lost
  6 : 0.62200 / 4.8200 / 6.9390 :   3 packets :   0 lost
  7 : 0.79400 / 4.9660 / 7.0790 :   3 packets :   0 lost
  8 : 10.100 / 12.282 / 16.384 :   3 packets :   0 lost
  9 : 78.367 / 89.550 / 109.93 :   3 packets :   0 lost
 10 : 148.04 / 161.33 / 179.35 :   3 packets :   0 lost
 11 : 133.66 / 143.25 / 153.59 :   3 packets :   0 lost
 12 : 150.13 / 166.09 / 191.24 :   3 packets :   0 lost
 13 : 141.47 / 149.19 / 158.12 :   3 packets :   0 lost
 14 : 148.17 / 163.34 / 188.85 :   3 packets :   0 lost
 15 : 0.0000 / 0.0000 / 0.0000 :   0 packets :   2 lost
     ------------------------Total--------------------------
total 0.0000 / 60.540 / 191.24 :  42 packets :   2 lost

References

-- FrancoisXavierAndreu & SimonMuyal - 06 Jun 2005

MTR (Matt's TraceRoute)

mtr (Matt's TraceRoute) combines the functionality of the traceroute and ping programs in a single network diagnostic tool.

Example

mtr.png

References

-- FrancoisXavierAndreu & SimonMuyal - 06 Jun 2005

-- SimonLeinen - 06 May 2005 - 26 Feb 2006

pathping

Seems to be similar to mtr, but for Windows systems. This is included in at least some modern versions of Windows.

References

  • "pathping", Windows XP Professional Product Documentation,
http://www.microsoft.com/resources/documentation/windows/xp/all/proddocs/en-us/pathping.mspx

-- SimonLeinen - 26 Feb 2006

PingPlotter

PingPlotter by Nessoft LLC combines traceroute, ping and whois to collect data for the Windows platform.

Example

A nice example of the tool can be found in the "Screenshot" section of its Web page. This contains not only the screenshot (included below by reference) but also a description of what you can read from it.

nessoftgraph.gif

References

-- FrancoisXavierAndreu & SimonMuyal - 06 Jun 2005
-- SimonLeinen - 26 Feb 2006

Traceroute Mesh Server

Traceroute Mesh combines traceroutes from many sources at once and builds a large map of the interconnections from the Internet to the specific IP. Very cool!

References

Example

Partial view of traceroute map created:

partial.png

-- HankNussbacher - 10 Jul 2005

tracepath

tracepath and tracepath6 trace the path to a network host, discovering MTU and asymmetry along this path. As described below, their applicability for path asymmetry measurements is quite limited, but the tools can still measure MTU rather reliably.

Methodology and Caveats

A path is considered asymmetric if the number of hops to a router is different from how much TTL was decremented while the ICMP error message was forwarded back from the router. The latter depends on knowing what was the original TTL the router used to send the ICMP error. The tools guess TTL values 64, 128 and 255. Obviously, a path might be asymmetric even if the forward and return paths were equally long, so the tool just catches one case of path asymmetry.

A major operational issue with this approach is that at least Juniper's M/T-series routers decrement TTL for ICMP errors they originate (e.g., the first hop router returns ICMP error with TTL=254 instead of TTL=255) as if they were forwarding the packet. This shows as path asymmetry.

Path MTU is measured by sending UDP packets with DF bit set. The packet size is the MTU of the host's outgoing link, which may be cached Path MTU Discovery for a given destination address. If a link MTU is lower than the tried path, the ICMP error tells the new path MTU which is used in subsequent probes.

As explained in tracepath(8), if MTU changes along the path, then the route will probably erroneously be declared as asymmetric.

Examples

IPv4 Example

This example shows a path from a host with 9000-byte "jumbo" MTU support to a host on a traditional 1500-byte Ethernet.

: leinen@cemp1[leinen]; tracepath diotima
 1:  cemp1.switch.ch (130.59.35.130)                        0.203ms pmtu 9000
 1:  swiCE2-G5-2.switch.ch (130.59.35.129)                  1.024ms
 2:  swiLS2-10GE-1-3.switch.ch (130.59.37.2)                1.959ms
 3:  swiEZ2-10GE-1-1.switch.ch (130.59.36.206)              5.287ms
 4:  swiCS3-P1.switch.ch (130.59.36.221)                    5.456ms
 5:  swiCS3-P1.switch.ch (130.59.36.221)                  asymm  4   5.467ms pmtu 1500
 6:  swiLM1-V610.switch.ch (130.59.15.230)                  4.864ms
 7:  swiLM1-V610.switch.ch (130.59.15.230)                asymm  6   5.209ms !H
     Resume: pmtu 1500

The router (interface) swiCS3-P1.switch.ch occurs twice; on the first line (hop 4), it returns an ICMP TTL Exceeded error, on the next (hop 5) it returns an ICMP "fragmentation needed and DF bit set" error. Unfortunately this causes tracepath to miss the "real" hop 5, and also to erroneously assume that the route is asymmetric at that point. One could consider this a bug, as tracepath could distinguish these different ICMP errors, and refrain from incrementing TTL when it reduces MTU (in response to the "fragmentation needed..." error).

When one retries the tracepath, the discovered Path MTU for the destination has been cached by the host, and you get a different result:

: leinen@cemp1[leinen]; tracepath diotima
 1:  cemp1.switch.ch (130.59.35.130)                        0.211ms pmtu 1500
 1:  swiCE2-G5-2.switch.ch (130.59.35.129)                  0.384ms
 2:  swiLS2-10GE-1-3.switch.ch (130.59.37.2)                1.214ms
 3:  swiEZ2-10GE-1-1.switch.ch (130.59.36.206)              4.620ms
 4:  swiCS3-P1.switch.ch (130.59.36.221)                    4.623ms
 5:  swiNM1-G1-0-25.switch.ch (130.59.15.237)               5.861ms
 6:  swiLM1-V610.switch.ch (130.59.15.230)                  4.845ms
 7:  swiLM1-V610.switch.ch (130.59.15.230)                asymm  6   5.226ms !H
     Resume: pmtu 1500

Note that hop 5 now shows up correctly and without an "asymm" warning. There is still an "asymm" warning at the end of the path, because a filter on the last-hop router swiLM1-V610.switch.ch prevents the UDP probes from reaching the final destination.

IPv6 Example

Here is the same path for IPv6, using tracepath6. Because of more relaxed UDP filters, the final destination is actually reached in this case:

: leinen@cemp1[leinen]; tracepath6 diotima
 1?: [LOCALHOST]                      pmtu 9000
 1:  swiCE2-G5-2.switch.ch                      1.654ms
 2:  swiLS2-10GE-1-3.switch.ch                  2.235ms
 3:  swiEZ2-10GE-1-1.switch.ch                  5.616ms
 4:  swiCS3-P1.switch.ch                        5.793ms
 5:  swiCS3-P1.switch.ch                      asymm  4   5.872ms pmtu 1500
 5:  swiNM1-G1-0-25.switch.ch                   5. 47ms
 6:  swiLM1-V610.switch.ch                      5. 79ms
 7:  diotima.switch.ch                          4.766ms reached
     Resume: pmtu 1500 hops 7 back 7

Again, once the Path MTU has been cached, tracepath6 starts out with that MTU, and will discover the correct path:

: leinen@cemp1[leinen]; tracepath6 diotima
 1?: [LOCALHOST]                      pmtu 1500
 1:  swiCE2-G5-2.switch.ch                      0.703ms
 2:  swiLS2-10GE-1-3.switch.ch                  8.786ms
 3:  swiEZ2-10GE-1-1.switch.ch                  4.904ms
 4:  swiCS3-P1.switch.ch                        4.979ms
 5:  swiNM1-G1-0-25.switch.ch                   4.989ms
 6:  swiLM1-V610.switch.ch                      6.578ms
 7:  diotima.switch.ch                          5.191ms reached
     Resume: pmtu 1500 hops 7 back 7

References

-- FrancoisXavierAndreu & SimonMuyal - 06 Jun 2005
-- SimonLeinen - 26 Feb 2006
-- PekkaSavola - 31 Aug 2006

Bandwidth Measurement Tools

(Click on blue headings to link through to detailed tool descriptions and examples)

Iperf

Iperf is a tool to measure maximum TCP bandwidth, allowing the tuning of various parameters and UDP characteristics. Iperf reports bandwidth, delay jitter, datagram loss. http://sourceforge.net/projects/iperf/

BWCTL

"BWCTL is a command line client application and a scheduling and policy daemon that wraps Iperf"

Home page: http://e2epi.internet2.edu/bwctl/

Exemple
http://e2epi.internet2.edu/pipes/pmp/pmp-switch.htm

nuttcp

Measurement tool very similar to iperf. It can be found here http://www.nuttcp.net/nuttcp/Welcome%20Page.html

NDT

Web100-based TCP tester that can be used from a Java applet.

Pchar

Hop-by-hop capacity measurements along a network path.

Netperf

A client/server network performance benchmark.

Home page: http://www.netperf.org/netperf/NetperfPage.html

Thrulay

A tool that performs TCP throughput tests and RTT measurements at the same time.

RUDE/CRUDE

RUDE is a package of applications to generate and measure the UDP traffic between two points. The rude generates traffic to the network, which can be received and logged on the other side of the network with the crude. Traffic pattern can be defined by user. Tool is available under http://rude.sourceforge.net/

NEPIM

nepim stands for network pipemeter, a tool for measuring available bandwidth between hosts. nepim is also useful to generate network traffic for testing purposes.

nepim operates in client/server mode, is able to handle multiple parallel traffic streams, reports periodic partial statistics along the testing, and supports IPv6.

Tool is available under http://www.nongnu.org/nepim/

TTCP

TTCP (Test TCP) is a utility for benchmarking UDP and TCP performance. Utility for Unix and Windows is available at http://www.pcausa.com/Utilities/pcattcp.htm

DSL Reports doesn't require any install. It checks the speed of your connection via a Java applet. One can choose from over 300 sites in the world from which to run the tests http://www.dslreports.com/stest

* Sample report produced by DSL Reports:
dslreport.png

Other online bandwidth measurement sits:

* http://myspeed.visualware.com/

* http://www.toast.net/performance/

* http://www.beelinebandwidthtest.com/

-- FrancoisXavierAndreu, SimonMuyal & SimonLeinen - 06-30 Jun 2005

-- HankNussbacher - 07 Jul 2005 & 15 Oct 2005 (DSL and online Reports section)

pchar

Characterize the bandwidth, latency and loss on network links. (See below the example for information on how pchar works)

(debian package : pchar)

Example:

pchar to 193.51.180.221 (193.51.180.221) using UDP/IPv4
Using raw socket input
Packet size increments from 32 to 1500 by 32
46 test(s) per repetition
32 repetition(s) per hop
 0: 193.51.183.185 (netflow-nri-a.cssi.renater.fr)
    Partial loss:      0 / 1472 (0%)
    Partial char:      rtt = 0.124246 ms, (b = 0.000206 ms/B), r2 = 0.997632
                       stddev rtt = 0.001224, stddev b = 0.000002
    Partial queueing:  avg = 0.000158 ms (765 bytes)
    Hop char:          rtt = 0.124246 ms, bw = 38783.892367 Kbps
    Hop queueing:      avg = 0.000158 ms (765 bytes)
 1: 193.51.183.186 (nri-a-g13-1-50.cssi.renater.fr)
    Partial loss:      0 / 1472 (0%)
    Partial char:      rtt = 1.087330 ms, (b = 0.000423 ms/B), r2 = 0.991169
                       stddev rtt = 0.004864, stddev b = 0.000006
    Partial queueing:  avg = 0.005093 ms (23535 bytes)
    Hop char:          rtt = 0.963084 ms, bw = 36913.554996 Kbps
    Hop queueing:      avg = 0.004935 ms (22770 bytes)
 2: 193.51.179.122 (nri-n3-a2-0-110.cssi.renater.fr)
    Partial loss:      5 / 1472 (0%)
    Partial char:      rtt = 697.145142 ms, (b = 0.032136 ms/B), r2 = 0.999991
                       stddev rtt = 0.011554, stddev b = 0.000014
    Partial queueing:  avg = 0.009681 ms (23679 bytes)
    Hop char:          rtt = 696.057813 ms, bw = 252.261443 Kbps
    Hop queueing:      avg = 0.004589 ms (144 bytes)
 3: 193.51.180.221 (caledonie-S1-0.cssi.renater.fr)
    Path length:       3 hops
    Path char:         rtt = 697.145142 ms r2 = 0.999991
    Path bottleneck:   252.261443 Kbps
    Path pipe:         21982 bytes
    Path queueing:     average = 0.009681 ms (23679 bytes)
    Start time:        Mon Jun  6 11:38:54 2005
    End time:          Mon Jun  6 12:15:28 2005
If you do not have access to a Unix system, you can run pchar remotely via: http://noc.greatplains.net/measurement/pchar.php

The README text below was written by Bruce A Mah and is taken from http://www.kitchenlab.org/www/bmah/Software/pchar/README, where there is information on how to obtain and install pchar.

PCHAR:  A TOOL FOR MEASURING NETWORK PATH CHARACTERISTICS
Bruce A. Mah
<bmah@kitchenlab.org>
$Id: PcharTool.txt,v 1.3 2005/08/09 16:59:27 TobyRodwell Exp www-data $
---------------------------------------------------------

INTRODUCTION
------------

pchar is a reimplementation of the pathchar utility, written by Van
Jacobson.  Both programs attempt to characterize the bandwidth,
latency, and loss of links along an end-to-end path through the
Internet.  pchar works in both IPv4 and IPv6 networks.

As of pchar-1.5, this program is no longer under active development,
and no further releases are planned.

...

A FEW NOTES ON PCHAR'S OPERATION
--------------------------------

pchar sends probe packets into the network of varying sizes and
analyzes ICMP messages produced by intermediate routers, or by the
target host.  By measuring the response time for packets of different
sizes, pchar can estimate the bandwidth and fixed round-trip delay
along the path.  pchar varies the TTL of the outgoing packets to get
responses from different intermediate routers.  It can use UDP or ICMP
packets as probes; either or both might be useful in different
situations.

At each hop, pchar sends a number of packets (controlled by the -R flag)
of varying sizes (controlled by the -i and -m flags).  pchar determines
the minimum response times for each packet size, in an attempt to
isolate jitter caused by network queueing.  It performs a simple
linear regression fit to the resulting minimum response times.  This
fit yields the partial path bandwidth and round-trip time estimates.

To yield the per-hop estimates, pchar computes the differences in the
linear regression parameter estimates for two adjacent partial-path
datasets.  (Earlier versions of pchar differenced the minima for the
datasets, then computed a linear regressions.)  The -a flag selects
between one of (currently) two different algorithms for performing the
linear regression, either a least squares fit or a nonparametric
method based on Kendall's test statistic.

Using the -b option causes pchar to send small packet bursts,
consisting of a string of back-to-back ICMP ECHO_REPLY packets
followed by the actual probe.  This can be useful in probing switched
networks.

CAVEATS
-------

Router implementations may very well forward a packet faster than they
can return an ICMP error message in response to a packet.  Because of
this fact, it's possible to see faster response times from longer
partial paths; the result is a seemingly non-sensical, negative
estimate of per-hop round-trip time.

Transient fluctuations in the network may also cause some odd results.

If all else fails, writing statistics to a file will give all of the
raw data that pchar used for its analysis.

Some types of networks are intrinsically difficult for pchar to
measure.  Two notable examples are switched networks (with multiple
queues at Layer 2) or striped networks.  We are currently
investigating methods for trying to measure these networks.

pchar needs superuser access due to its use of raw sockets.

...

-- FrancoisXavierAndreu & SimonMuyal - 06 Jun 2005

-- HankNussbacher - 10 Jul 2005 (Great Plains server)

-- TobyRodwell - 09 Aug 2005 (added Brudce A Mah's Readme text)

Iperf

Iperf is a tool to measure TCP throughput and available bandwidth, allowing the tuning of various parameters and UDP characteristics. Iperf reports bandwidth, delay variation, and datagram loss.

The popular Iperf 2 releases were developed by NLANR/DAST (http://dast.nlanr.net/Projects/Iperf/) and maintained at http://sourceforge.net/projects/iperf. As of April 2014, the last released version was 2.0.5, from July 2010.

A script that automates the starting and then stopping iperf servers is here . This can be invoked from a remote machine (say a NOC workstation) to simplify starting, and more importantly stopping, an iperf server.

Iperf 3

ESnet and the Lawrence Berkeley National Laboratory have developed a from-scratch reimplementation of Iperf called Iperf 3. It has a Github repository. It is not compatible with iperf 2, and has additional interesting features such as a zero-copy TCP mode (-Z flag), JSON output (-J), and reporting of TCP retransmission counts and CPU utilization (-V). It also supports SCTP in addition to UDP and TCP. Since December 2013, various public releases were made on http://stats.es.net/software/.

Usage Examples

TCP Throughput Test

The following shows a TCP throughput test, which is iperf's default action. The following options are given:

  • -s - server mode. In iperf, the server will receive the test data stream.
  • -c server - client mode. The name (or IP address) of the server should be given. The client will transmit the test stream.
  • -i interval - display interval. Without this option, iperf will run the test silently, and only write a summary after the test has finished. With -i, the program will report intermediate results at given intervals (in seconds).
  • -w windowsize - select a non-default TCP window size. To achieve high rates over paths with a large bandwidth-delay product, it is often necessary to select a larger TCP window size than the (operating system) default.
  • -l buffer length - specify the length of send or receive buffer. In UDP, this sets the packet size. In TCP, this sets the send/receive buffer length (possibly using system defaults). Using this may be important especially if the operating system default send buffer is too small (e.g. in Windows XP).

NOTE -c and -s arguments must be given first. Otherwise some configuration options are ignored.

The -i 1 option was given to obtain intermediate reports every second, in addition to the final report at the end of the ten-second test. The TCP buffer size was set to 2 Megabytes (4 Megabytes effective, see below) in order to permit close to line-rate transfers. The systems haven't been fully tuned, otherwise up to 7 Gb/s of TCP throughput should be possible. Normal background traffic on the 10 Gb/s backbone is on the order of 30-100 Mb/s. Note that in iperf, by default it is the client that transmits to the server.

Server Side:

welti@ezmp3:~$ iperf -s -w 2M -i 1
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 4.00 MByte (WARNING: requested 2.00 MByte)
------------------------------------------------------------
[  4] local 130.59.35.106 port 5001 connected with 130.59.35.82 port 41143
[  4]  0.0- 1.0 sec    405 MBytes  3.40 Gbits/sec
[  4]  1.0- 2.0 sec    424 MBytes  3.56 Gbits/sec
[  4]  2.0- 3.0 sec    425 MBytes  3.56 Gbits/sec
[  4]  3.0- 4.0 sec    422 MBytes  3.54 Gbits/sec
[  4]  4.0- 5.0 sec    424 MBytes  3.56 Gbits/sec
[  4]  5.0- 6.0 sec    422 MBytes  3.54 Gbits/sec
[  4]  6.0- 7.0 sec    424 MBytes  3.56 Gbits/sec
[  4]  7.0- 8.0 sec    423 MBytes  3.55 Gbits/sec
[  4]  8.0- 9.0 sec    424 MBytes  3.56 Gbits/sec
[  4]  9.0-10.0 sec    413 MBytes  3.47 Gbits/sec
[  4]  0.0-10.0 sec  4.11 GBytes  3.53 Gbits/sec

Client Side:

welti@mamp1:~$ iperf -c ezmp3 -w 2M -i 1
------------------------------------------------------------
Client connecting to ezmp3, TCP port 5001
TCP window size: 4.00 MByte (WARNING: requested 2.00 MByte)
------------------------------------------------------------
[  3] local 130.59.35.82 port 41143 connected with 130.59.35.106 port 5001
[  3]  0.0- 1.0 sec    405 MBytes  3.40 Gbits/sec
[  3]  1.0- 2.0 sec    424 MBytes  3.56 Gbits/sec
[  3]  2.0- 3.0 sec    425 MBytes  3.56 Gbits/sec
[  3]  3.0- 4.0 sec    422 MBytes  3.54 Gbits/sec
[  3]  4.0- 5.0 sec    424 MBytes  3.56 Gbits/sec
[  3]  5.0- 6.0 sec    422 MBytes  3.54 Gbits/sec
[  3]  6.0- 7.0 sec    424 MBytes  3.56 Gbits/sec
[  3]  7.0- 8.0 sec    423 MBytes  3.55 Gbits/sec
[  3]  8.0- 9.0 sec    424 MBytes  3.56 Gbits/sec
[  3]  0.0-10.0 sec  4.11 GBytes  3.53 Gbits/sec

UDP Test

In the following example, we send a 300 Mb/s UDP test stream. No packets were lost along the path, although one arrived out-of-order. Another interesting result is jitter, which is displayed as 27 or 28 microseconds (apparently there is some rounding error or other impreciseness that prevents the client and server from agreeing on the value). According to the documentation, "Jitter is the smoothed mean of differences between consecutive transit times."

Server Side

: leinen@mamp1[leinen]; iperf -s -u
------------------------------------------------------------
Server listening on UDP port 5001
Receiving 1470 byte datagrams
UDP buffer size: 64.0 KByte (default)
------------------------------------------------------------
[  3] local 130.59.35.82 port 5001 connected with 130.59.35.106 port 38750
[  3]  0.0-10.0 sec    359 MBytes    302 Mbits/sec  0.028 ms    0/256410 (0%)
[  3]  0.0-10.0 sec  1 datagrams received out-of-order

Client Side

: leinen@ezmp3[leinen]; iperf -c mamp1-eth0 -u -b 300M
------------------------------------------------------------
Client connecting to mamp1-eth0, UDP port 5001
Sending 1470 byte datagrams
UDP buffer size: 64.0 KByte (default)
------------------------------------------------------------
[  3] local 130.59.35.106 port 38750 connected with 130.59.35.82 port 5001
[  3]  0.0-10.0 sec    359 MBytes    302 Mbits/sec
[  3] Sent 256411 datagrams
[  3] Server Report:
[  3]  0.0-10.0 sec    359 MBytes    302 Mbits/sec  0.027 ms    0/256410 (0%)
[  3]  0.0-10.0 sec  1 datagrams received out-of-order

As you would expect, during a UDP test traffic is only sent from the client to the server see here for an example with tcpdump.

Problem isolation procedures using iperf

TCP Throughput measurements

Typically end users are reporting throughput problems as they see on with their applications, like unexpected slow file transfer times. Some users may already report TCP throughput results as measured with iperf. In any case, network administrators should validate the throughput problem. It is recommended this be done using iperf end-to-end measurements in TCP mode between the end systems' memory. The window size of the TCP measurement should follow the bandwidth*delay product rule, and should therefore be set to at least the measured round trip time multiplied by the path's bottle-neck speed. If the actual bottleneck is not known (because of lack on knowledge of the end-to-end path) then it should be assumed the bottleneck is the slowest of the two end-systems' network interface cards.

For instance if one system is connected with Gigabit Ethernet, but the other one with Fast Ethernet and the measured round trip time is 150ms, then the window size should be set to 100 Mbit/s * 0.150s / 8 = 1875000 bytes, so setting the TCP window to a value of 2 MBytes would be a good choice.

In theory the TCP throughput could reach, but not exceed, the available bandwidth on an end-to-end path. The knowledge of that network metric is therefore important for distinguishing between issues with the end system's TCP stacks, or network related problems.

Available bandwidth measurements

Iperf could be used in UDP mode for measuring the available bandwidth. Only short duration measurements in the range of 10 seconds should be done so as not to disturb other production flows. The goal of UDP measurements is to find the maximum UDP sending rate that results in almost no packet loss on the end-to-end path, in good practice the packet loss threshold is 1%. UDP data transfers that results in higher packet losses are likely to disturb TCP production flows and therefore should be avoided. A practicable procedure to find the available bandwidth value is to start with UDP data transfers with a 10s duration and with interim result reports at one second intervals. The data rate to start with should be slightly below the reported TCP throughput. If the measured packet loss values are below the threshold then a new measurement with slightly increased data rate could be started. This procedure of small UDP data transfers with increasing data rate should be repeated until the packet loss threshold is exceeded. Depending on the required result's accuracy further tests can be started beginning with the maximum data rate causing packet losses below the threshold and with smaller data rate increasing intervals. At the end the maximum data rate that caused packet losses below the threshold could be seen as a good measurement of the available bandwidth on the end to end path.

By comparing the reported applications throughput with the measured TCP throughput and the measured available bandwidth, it is possible to distinguish between applications problems, TCP stack problems, or network issues. Note however that differing nature of UDP and TCP flows means that it their measurements should not be directly compared. Iperf sends UDP datagrams are a constant steady rate, whereas TPC tends to send packet trains. This means that TCP is likely to suffer from congestion effects at a lower data rate than UDP.

In case of unexpected low available bandwidth measurements on the end-to-end path, network administrators are interested on the bandwidth bottleneck. The best way to get this value is to retrieve it from passively measured link utilisations and provided capacities on all links along the path. However, if the path is crossing multiple administrative domains this is often not possible because of restrictions in getting those values from other domains. Therefore, it is common practice to use measurement workstations along the end-to-end path, and thus separate the end-to-end path in segments on which available bandwidth measurements are done. This way it is possible to identify the segment on which the bottleneck occurs and to concentrate on that during further troubleshooting procedures.

Other iperf use cases

Besides the capability of measuring TCP throughput and available bandwidth, in UDP mode iperf can report on packet reordering and delay jitter.

Other use cases for measurements using iperf are IPv6 bandwidth measurements and IP multicast performance measurements. More information of the iperf features, its source and binary code for different UNIXes and Microsoft Windows operating systems can be retrieved from the Iperf Web site.

Caveats and Known Issues

Impact on other traffic

As Iperf sends real full data streams it can reduce the available bandwidth on a given path. In TCP mode, the effect to the co-existing production flows should be negligible, assuming the number of production flows is much greater than the number of test data flows, which is normally a valid assumption on paths through a WAN. However, in UDP mode iperf has the potential to disturb production traffic, and in particular TCP streams, if the sender's data rate exceeds the available bandwidth on a path. Therefore, one should take particular care whenever running iperf tests in UDP mode.

TCP buffer allocation

On Linux systems, if you request a specific TCP buffer size with the "-w" option, the kernel will always try to allocate double as much bytes as you specified.

Example: when you request 2MB window size you'll receive 4MB:

welti@mamp1:~$ iperf -c ezmp3 -w 2M -i 1
------------------------------------------------------------
Client connecting to ezmp3, TCP port 5001
TCP window size: 4.00 MByte (WARNING: requested 2.00 MByte)    <<<<<<
------------------------------------------------------------

Counter overflow

Some versions seem to suffer from a 32-bit integer overflow which will lead to wrong results.

e.g.:

[ 14]  0.0-10.0 sec  953315416 Bytes  762652333 bits/sec
[ 14] 10.0-20.0 sec  1173758936 Bytes  939007149 bits/sec
[ 14] 20.0-30.0 sec  1173783552 Bytes  939026842 bits/sec
[ 14] 30.0-40.0 sec  1173769072 Bytes  939015258 bits/sec
[ 14] 40.0-50.0 sec  1173783552 Bytes  939026842 bits/sec
[ 14] 50.0-60.0 sec  1173751696 Bytes  939001357 bits/sec
[ 14]  0.0-60.0 sec  2531115008 Bytes  337294201 bits/sec
[ 14] MSS size 1448 bytes (MTU 1500 bytes, ethernet)

As you can see the summary 0-60 seconds doesn't match the average that one would expect. This is due to the fact that the total number of Bytes is not correct as a result of a counter wrap.

If you're experiencing this kind of effects, upgrade to the latest version of iperf, which should have this bug fixed.

UDP buffer sizing

The UDP buffer sizing (the -w parameter) is also required with high-speed UDP transmissions. Otherwise (typically) the UDP receive buffer will overflow and this will look like packet loss (but looking at tcpdumps or counters reveals you got all the data). This will show in UDP statistics (for example, on Linux with 'netstat -s' under Udp: ... receive packet errors). See more information at: http://www.29west.com/docs/THPM/udp-buffer-sizing.html

Control of measurements

There are two typical deployment scenarios which differ in the kind of access the operator has to the sender and receiver instances. A measurement between well-located measurement workstations within an administrative domain e.g. a campus network allow network administrators full control on the server and client configurations (including test schedules), and allows them to retrieve full measurement results. Measurements on paths that extend beyond the administrative domain borders require access or collaboration with administrators of the far-end systems. Iperf has two features implemented that simplify its use in this scenario, such that the operator does not to need of have an interactive login account on the far-end system:

  • The server instance may run as a daemon (option -D) listening on a configurable transport protocol port, and
  • It is possible to bi-directional tests, either one after the other (option -r) , or simultaneously (option -d).

Screen

Another method of running iperf on a *NIX device is to use 'screen'. Screen is a utility that lets you keep a session running even once you have logged out. It is described more fully here in its man pages, but a simple sequence applicable to iperf would be as follows:

[user@host]$screen -d -m iperf -s -p 5002 
This starts iperf -s -p 5002 as a 'detached' session

[user@host]$screen -ls
There is a screen on:
        25278..mp1      (Detached)
1 Socket in /tmp/uscreens/S-admin.
'screen -ls' shows open sessions.

'screen -r' reconnects to a running session . when in that session keying 'CNTL+a', then 'd' detaches the screen. You can if you wish log out, log back in again, and re-attach. To end the iperf session (and a screen) just hit 'CNTL+c' whilst attached.

Note that BWCTL offers additional control and resource limitation features that make it more suitable for use over administrative domains.

Related Work

Public iperf servers

There used to be some public iperf servers available, but at present none is known anymore. Similar services are provided by BWCTL (see below) and by public NDT servers.

BWCTL

BWCTL (BandWidth test ConTroLler) is a wrapper around iperf that provides scheduling and remote control of measurements.

Instrumented iperf (iperf100)

The iperf code provided by NLANR/DAST was instrumented in order to provide more information to the user. Iperf100 displays various web100 variables at the end of a transfer.

Patches are available at http://www.csm.ornl.gov/~dunigan/netperf/download/

The Instrumented iperf requires machine running a kernel.org linux-2.X.XX kernel with the latest web100 patches applied (http://www.web100.org)

Jperf

Jperf is a Java-based graphical front-end to iperf. It is now included as part of the iperf project on SourceForge. This was once available as a separate project called xjperf, but that seems to have been given up in favor of iperf/SourceForge integration.

iPerf for Android

An Android version of iperf appeared on Google Play (formerly the Android Market) in 2010.

nuttcp

Similar to iperf, but with an additional control connection that makes it somewhat more versatile.

-- FrancoisXavierAndreu & SimonMuyal - 06 Jun 2005
-- HankNussbacher - 10 Jul 2005 (Great Plains server)
-- AnnHarding & OrlaMcGann - Aug 2005 (DS3.3.2 content)
-- SimonLeinen - 08 Feb 2006 (OpenSS7 variant, BWCTL pointer)
-- BartoszBelter - 28 Mar 2006 (iperf100)
-- ChrisWelti - 11 Apr 2006 (examples, 32-bit overflows, buffer allocation)
-- SimonLeinen - 01 Jun 2006 (integrated DS3.3.2 text from Ann and Orla)
-- SimonLeinen - 17 Sep 2006 (tracked iperf100 pointer)
-- PekkaSavola - 26 Mar 2008 (added warning about -c/s having to be first, a common gotcha)
-- PekkaSavola - 05 Jun 2008 (added discussion of '-l' parameter and its significance
-- PekkaSavola - 30 Apr 2009 (added discussion of UDP (receive) buffer sizing significance
-- SimonLeinen - 23 Feb 2012 (removed installation instructions, obsolete public iperf servers)
-- SimonLeinen - 22 May 2012 (added notes about Iperf 3, Jperf and iPerf for Android)
-- SimonLeinen - 01 Feb 2013 (added pointer to Android app; cross-reference to Nuttcp)
-- SimonLeinen - 06 April 2014 (updated Iperf 3 section: now on Github and with releases)
-- SimonLeinen - 05 May 2014 (updated Iperf 3 section: documented more new features)

BWCTL

BWCTL is a command line client application and a scheduling and policy daemon that wraps throughput measurement tools such as iperf, thrulay, and nuttcp (versions prior to 1.3 only support iperf).

More Information: For configuration, common problems, examples etc
Description of bwctld.limits parameters

A typical BWCTL result looks like one of the two print outs below. The first print out shows a test run from the local host to 193.136.3.155 ('-c' stands for 'collector'). The second print out shows a test run to the localhost from 193.136.3.155 ('-s' stands for 'sender').

[user@ws4 user]# bwctl -c 193.136.3.155
bwctl: 17 seconds until test results available
RECEIVER START
3339497760.433479: iperf -B 193.136.3.155 -P 1 -s -f b -m -p 5001 -t 10
------------------------------------------------------------
Server listening on TCP port 5001
Binding to local address 193.136.3.155
TCP window size: 65536 Byte (default)
------------------------------------------------------------
[ 14] local 193.136.3.155 port 5001 connected with 62.40.108.82 port 35360
[ 14]  0.0-10.0 sec  16965632 Bytes  13547803 bits/sec
[ 14] MSS size 1448 bytes (MTU 1500 bytes, ethernet)
RECEIVER END


[user@ws4 user]# bwctl -s 193.136.3.155
bwctl: 17 seconds until test results available
RECEIVER START
3339497801.610690: iperf -B 62.40.108.82 -P 1 -s -f b -m -p 5004 -t 10
------------------------------------------------------------
Server listening on TCP port 5004
Binding to local address 62.40.108.82
TCP window size: 87380 Byte (default)
------------------------------------------------------------
[ 16] local 62.40.108.82 port 5004 connected with 193.136.3.155 port 5004
[ ID] Interval       Transfer     Bandwidth
[ 16]  0.0-10.0 sec  8298496 Bytes  6622281 bits/sec
[ 16] MSS size 1448 bytes (MTU 1500 bytes, ethernet)
[ 16] Read lengths occurring in more than 5% of reads:
[ 16]  1448 bytes read  5708 times (99.8%)
RECEIVER END

The read lengths are shown when the test is for received traffic. In his e-mail to TobyRodwell on 28 March 2005 Stanislav Shalunov of Internet2 writes:

"The values are produced by Iperf and reproduced by BWCTL.

An Iperf server gives you several most frequent values that the read() system call returned during a TCP test, along with the percentages of all read() calls for which the given read() lengths accounted.

This can let you judge the efficiency of interrupt coalescence implementations, kernel scheduling strategies and other such esoteric applesauce. Basically, reads shorter than 8kB can put quite a bit of load on the CPU (each read() call involves several microseconds of context switching). If the target rate is beyond a few hundred megabits per second, one would need to pay attention to read() lengths."

References (Pointers)

-- TobyRodwell - 28 Oct 2005
-- FrancoisXavierAndreu & SimonMuyal - 06 Jun 2005
-- SimonLeinen - 22 Apr 2008 - 07 Dec 2008

Netperf

A client/server network performance benchmark (debian package: netperf).

Example

coming soon

Flent (previously netperf-wrapper)

For his measurement work on fq_codel, Toke Høiland-Jørgensen developed Python wrappers for Netperf and added some tests, notably the RRUL (Realtime Response Under Load) test, which runs bulk TCP transfers in parallel with UDP and ICMP response-time measurements. This can be used to expose Bufferbloat on a path.

These tools used to be known as "netperf-wrappers", but were renamed to Flent (the Flexible Network Tester) in early 2015.

References

-- FrancoisXavierAndreu & SimonMuyal - 2005-06-06
-- SimonLeinen - 2013-03-30 - 2015-06-01

RUDE/CRUDE

RUDE is a package of applications to generate and measure UDP traffic between two endpoints (hosts). The rude tool generates traffic to the network, which can be received and logged on the other side of the network with crude. The traffic pattern can be defined by the user. This tool is available under http://rude.sourceforge.net/

Example

The following rude script will cause a constant-rate packet stream to be sent to a destination for 4 seconds, starting one second (1000 milliseconds) from the time the script was read, stopping five seconds (5000 milliseconds) after the script was read. The script can be called using rude -s example.rude.

START NOW
1000 0030 ON 3002 10.1.1.1:10001 CONSTANT 200 250
5000 0030 OFF

Explanation of the values: 1000 and 5000 are relative timestamps (in milliseconds) for the actions. 0030 is the identification of a traffic flow. At 1000, a CONSTANT-rate flow is started (ON) with source port 3002, destination address 10.1.1.1 and destination port 10001. CONSTANT 200 250 specifies a fixed rate of 200 packets per second (pps) of 250-byte UDP datagrams.

-- FrancoisXavierAndreu & SimonMuyal - 06 Jun 2005
-- SimonLeinen - 28 Sep 2008

TTCP

TTCP (Test TCP) is a utility for benchmarking UDP and TCP performance. Utility for Unix and Windows is available at http://www.pcausa.com/Utilities/pcattcp.htm/

-- FrancoisXavierAndreu & SimonMuyal - 06 Jun 2005

NDT (Network Diagnostic Tool)

NDT can be used to check the TCP configuration of any host that can run Java applets. The client connects to a Web page containing a special applet. The Web server that serves the applet must run a kernel with the Web100 extensions for TCP measurement instrumentation. The applet performs TCP memory-to-memory tests between the client and the Web100-instrumented server, and then uses the measurements from the Web100 instrumentation to find out about the TCP configuration of the client. NDT also detects common configuration errors such as duplex mismatch.

Besides the applet, there is also a command-line client called web100clt that can be used without a browser. The client works on Linux without any need for Web100 extensions, and can be compiled on other Unix systems as well. It doesn't require a Web server on the server side, just the NDT server web100srv - which requires a Web100-enhanced kernel.

Applicability

Since NDT performs memory-to-memory tests, it avoids end-system bottlenecks such as file system or disk limitations. Therefore it is useful for estimating "pure" TCP throughput limitations for a system, better than, for instance, measuring the throughput of a large file retrieval from a public FTP or WWW server. In addition, NDT servers are supposed to be well-connected and lightly loaded.

When trying to tune a host for maximum throughput, it is a good idea to start testing it against an NDT server that is known to be relatively close to the host in question. Once the throughput with this server is satisfactory, one can try to further tune the host's TCP against a more distant NDT server. Several European NRENs as well as many sites in the U.S. operate NDT servers.

Example

Applet (tcp100bw.html)

This example shows the results (overview) of an NDT measurement between a client in Switzerland (Gigabit Ethernet connection) and an IPv6 NDT server in Ljubljana, Slovenia (Gigabit Ethernet connection).

Screen_shot_2011-07-25_at_18.11.44_.png

Command-line client

The following is an example run of the web100clt command-line client application. In this case, the client runs on a small Linux-based home router connected to a commercial broadband-over-cable Internet connection. The connection is marketed as "5000 kb/s downstream, 500 kb/s upstream", and it can be seen that this nominal bandwidth can in fact be achieved for an individual TCP connection.

$ ./web100clt -n ndt.switch.ch
Testing network path for configuration and performance problems  --  Using IPv6 address
Checking for Middleboxes . . . . . . . . . . . . . . . . . .  Done
checking for firewalls . . . . . . . . . . . . . . . . . . .  Done
running 10s outbound test (client to server) . . . . .  509.00 kb/s
running 10s inbound test (server to client) . . . . . . 5.07 Mb/s
Your host is connected to a Cable/DSL modem
Information [C2S]: Excessive packet queuing detected: 10.18% (local buffers)
Information [S2C]: Excessive packet queuing detected: 73.20% (local buffers)
Server 'ndt.switch.ch' is not behind a firewall. [Connection to the ephemeral port was successful]
Client is probably behind a firewall. [Connection to the ephemeral port failed]
Information: Network Middlebox is modifying MSS variable (changed to 1440)
Server IP addresses are preserved End-to-End
Client IP addresses are preserved End-to-End

Note the remark "-- Using IPv6 address". The example used the new version 3.3.12 of the NDT software, which includes support for both IPv4 and IPv6. Because both the client and the server support IPv6, the test is run over IPv6. To run an IPv4 test in this situation, one could have called the client with the -4 option, i.e. web100clt -4 -n ndt.switch.ch.

Public Servers

There are many public NDT server installations available on the Web. You should choose between the servers according to distance (in terms of RTT) and the throughput range you are interested in. The limits of your local connectivity can best be tested by connecting to a server with a fast(er) connection that is close by. If you want to tune your TCP parameters for "LFNs", use a server that is far away from you, but still reachable over a path without bottlenecks.

Development

Development of NDT is now hosted on Google Code, and as of July 2011 there is some activity. Notably, there is now documentation about the architecture as well as the detailed protocol or NDT. One of the stated goals is to allow outside parties to write compatible NDT clients without consulting the NDT source code.

References

-- SimonLeinen - 30 Jun 2005 - 22 Nov 2011

Active Measurement Tools

Active measurement injects traffic into a network to measure properties about that network.

  • ping - a simple RTT and loss measurement tool using ICMP ECHO messages
  • fping - ping variant supporting concurrent measurement to multiple destinations
  • SmokePing - nice graphical representation of RTT distribution and loss rate from periodic "pings"
  • OWAMP - a protocol and tool for one-way measurements

There are several permanent infrastructures performing active measurements:

  • HADES DFN-developed HADES Active Delay Evaluation System, deployed in GEANT/GEANT2 and DFN
  • RIPE TTM
  • QoSMetricsBoxes used in RENATER's Métrologie project

-- SimonLeinen - 29 Mar 2006

ping

Ping sends ICMP echo request messages to a remote host and waits for ICMP echo replies to determine the latency between those hosts. The output shows the Round Trip Time (RTT) between the host machine and the remote target. Ping is often used to determine whether a remote host is reachable. Unfortunately, it is quite common these days for ICMP traffic to be blocked by packet filters / firewalls, so a ping timing out does not necessarily mean that a host is unreachable.

(Another method for checking remote host availability is to telnet to a port that you know to be accessible; such as port 80 (HTTP) or 25 (SMTP). If the connection is still timing out, then the host is probably not reachable - often because of a filter/firewall silently blocking the traffic. When there is no service listening on that port, this usually results in a "Connection refused" message. If a connection is made to the remote host, then it can be ended by typing the escape character (usually 'CTRL ^ ]') then quit.)

The -f flag may be specified to send ping packets as fast as they come back, or 100 times per second - whichever is more frequent. As this option can be very hard on the network only a super-user (the root account on *NIX machines) is allowed to specified this flag in a ping command.

The -c flag specifies the number of Echo request messages sent to the remote host by ping. If this flag isn't used, then the ping continues to send echo request messages until the user types CTRL C. If the ping is cancelled after only a few messages have been sent, the RTT summary statistics that are displayed at the end of the ping output may not have finished being calculated and won't be completely accurate. To gain an accurate representation of the RTT, it is recommended to set a count of 100 pings. The MS Windows implementation of ping just sends 4 echo request messages by default.

Ping6 is the IPv6 implementation of the ping program. It works in the same way, but sends ICMPv6 Echo Request packets and waits for ICMPv6 Echo Reply packets to determine the RTT between two hosts. There are no discernable differences between the Unix implementations and the MS Windows implementation.

As with traceroute, Solaris integrates IPv4 and IPv6 functionality in a single ping binary, and provides options to select between them (-A <AF>).

See fping for a ping variant that supports concurrent measurements to multiple destinations and is easier to use in scripts.

Implementation-specific notes

-- TobyRodwell - 06 Apr 2005
-- MarcoMarletta - 15 Nov 2006 (Added the difference in size between Juniper and Cisco)

fping

fping is a ping like program which uses the Internet Control Message Protocol (ICMP) echo request to determine if a host is up. fping is different from ping in that you can specify any number of hosts on the command line, or specify a file containing the lists of hosts to ping. Instead of trying one host until it timeouts or replies, fping will send out a ping packet and move on to the next host in a round-robin fashion. If a host replies, it is noted and removed from the list of hosts to check. If a host does not respond within a certain time limit and/or retry limit it will be considered unreachable.

Unlike ping, fping is meant to be used in scripts and its output is easy to parse. It is often used as a probe for packet loss and round-trip time in SmokePing.

fping version 3

The original fping was written by Roland Schemers in 1992, and stopped being updated in 2002. In December 2011, David Schweikert decided to take up maintainership of fping, and increased the major release number to 3, mainly to reflect the change of maintainer. Changes from earlier versions include:

  • integration of all Debian patches, including IPv6 support (2.4b2-to-ipv6-16)
  • an optimized main loop that is claimed to bring performance close to the theoretical maximum
  • patches to improve SmokePing compatibility

The Web site location has changed to fping.org, and the source is now maintained on GitHub.

Since July 2013, there is also a Mailing List, which will be used to announce new releases.

Examples

Simple example of usage:

# fping -c 3 -s www.man.poznan.pl www.google.pl
www.google.pl     : [0], 96 bytes, 8.81 ms (8.81 avg, 0% loss)
www.man.poznan.pl : [0], 96 bytes, 37.7 ms (37.7 avg, 0% loss)
www.google.pl     : [1], 96 bytes, 8.80 ms (8.80 avg, 0% loss)
www.man.poznan.pl : [1], 96 bytes, 37.5 ms (37.6 avg, 0% loss)
www.google.pl     : [2], 96 bytes, 8.76 ms (8.79 avg, 0% loss)
www.man.poznan.pl : [2], 96 bytes, 37.5 ms (37.6 avg, 0% loss)

www.man.poznan.pl : xmt/rcv/%loss = 3/3/0%, min/avg/max = 37.5/37.6/37.7
www.google.pl     : xmt/rcv/%loss = 3/3/0%, min/avg/max = 8.76/8.79/8.81

       2 targets
       2 alive
       0 unreachable
       0 unknown addresses

       0 timeouts (waiting for response)
       6 ICMP Echos sent
       6 ICMP Echo Replies received
       0 other ICMP received

 8.76 ms (min round trip time)
 23.1 ms (avg round trip time)
 37.7 ms (max round trip time)
        2.039 sec (elapsed real time)

IPv6 Support

Jeroen Massar has added IPv6 support to fping. This has been implemented as a compile-time variant, so that there are separate fping (for IPv4) and fping6 (for IPv6) binaries. The IPv6 patch has been partially integrated into the fping version on www.fping.com as of release "2.4b2_to-ipv6" (thus also integrated in fping 3.0). Unfortunately his modifications to the build routine seem to have been lost in the integration, so that the fping.com version only installs the IPv6 version as fping. Jeroen's original version doesn't have this problem, and can be downloaded from his IPv6 Web page.

ICMP Sequence Number handling

Older versions of fping used the Sequence Number field in ICMP ECHO requests in a peculiar way: it used a different sequence number for each destination host, but used the same sequence number for all requests to a specific host. There have been reports of specific systems that suppress (or rate-limit) ICMP ECHO requests with repeated sequence numbers, which causes high loss rates reported from tools that use fping, such as SmokePing. Another issue is that fping could not distinguish a perfect link from one that drops every other packet and that duplicates every other.

Newer fping versions such as 3.0 or 2.4.2b2_to (on Debian GNU/Linux) include a change to sequence number handling attributed to Stephan Fuhrmann. These versions increment sequence numbers for every probe sent, which should solve both of these problems.

References

-- BartoszBelter - 2005-07-14 - 2005-07-26
-- SimonLeinen - 2008-05-19 - 2013-07-26

SmokePing

Smokeping is a software-based measurement framework that uses various software modules (called probes) to measure round trip times, packet losses and availability at layer 3 (IPv4 and IPv6) and even applications? latencies. Layer 3 measurements are based on the ping tool and for analysis of applications there are such probes as for measuring DNS lookup and RADIUS authentication latencies. The measurements are centrally controlled on a single host from which the software probes are started, which in turn emit active measurement flows. The load impact to the network by these streams is usually negligible.

As with MRTG the results are stored in RRD databases in the original polling time intervals for 24 hours and then aggregated over time, and the peak values and mean values on a 24h interval are stored for more than one year. Like MRTG, the results are usually displayed in a web browser, using html embedded graphs in daily, monthly, weekly and yearly timeframes. A particular strength of smoke ping is the graphical manner in which it displays the statistical distribution of latency values over time.

The tool has also an alarm feature that, based on flexible threshold rules, either sends out emails or runs external scripts.

The following picture shows an example output of a weekly graph. The background colour indicates the link availability, and the foreground lines display the mean round trip times. The shadows around the lines indicate graphically about the statistical distribution of the measured round-trip times.

  • Example SmokePing graph:
    Example SmokePing graph

The 2.4* versions of Smokeping included SmokeTrace, an AJAX-based traceroute tool similar to mtr, but browser/server based. This was removed in 2.5.0 in favor of a separate, more general, tool called remOcular.

In August 2013, Tobi Oetiker announced that he received funding for the development of SmokePing 3.0. This will use Extopus as a front-end. This new version will be developed in public on Github. Another plan is to move to an event-based design that will make Smokeping more efficient and allow it to scale to a large number of probes.

Related Tools

The OpenNMS network management platform includes a tool called StrafePing which is heavily inspired by SmokePing.

References

-- SimonLeinen - 2006-04-07 - 2013-11-03

OWAMP (One-Way Active Measurement Tool)

OWAMP is a command line client application and a policy daemon used to determine one way latencies between hosts. It is an implementation of the OWAMP protocol (see references) that is currently going through the standardization process within the IPPM WG in the IETF.

With roundtrip-based measurements, it is hard to isolate the direction in which congestion is experienced. One-way measurements solve this problem and make the direction of congestion immediately apparent. Since traffic can be asymmetric at many sites that are primarily producers or consumers of data, this allows for more informative measurements. One-way measurements allow the user to better isolate the effects of specific parts of a network on the treatment of traffic.

The current OWAMP implementation (V3.1) supports IPv4 and IPv6.

Example using owping (OWAMP v3.1) on a linux host:

welti@atitlan:~$ owping ezmp3
Approximately 13.0 seconds until results available

--- owping statistics from [2001:620:0:114:21b:78ff:fe30:2974]:52887 to [ezmp3]:41530 ---
SID:    feae18e9cd32e3ee5806d0d490df41bd
first:  2009-02-03T16:40:31.678
last:   2009-02-03T16:40:40.894
100 sent, 0 lost (0.000%), 0 duplicates
one-way delay min/median/max = 2.4/2.5/3.39 ms, (err=0.43 ms)
one-way jitter = 0 ms (P95-P50)
TTL not reported
no reordering


--- owping statistics from [ezmp3]:41531 to [2001:620:0:114:21b:78ff:fe30:2974]:42315 ---
SID:    fe302974cd32e3ee7ea1a5004fddc6ff
first:  2009-02-03T16:40:31.479
last:   2009-02-03T16:40:40.685
100 sent, 0 lost (0.000%), 0 duplicates
one-way delay min/median/max = 1.99/2.1/2.06 ms, (err=0.43 ms)
one-way jitter = 0 ms (P95-P50)
TTL not reported
no reordering

References

-- ChrisWelti - 03 Apr 2006 - 03 Feb 2009 -- SimonLeinen - 29 Mar 2006 - 01 Dec 2008

Active Measurement Boxes

-- FrancoisXavierAndreu & SimonLeinen - 06 Jun 2005 - 06 Jul 2005 Warning: Can't find topic PERTKB.IppmBoxes

RIPE TTM Boxes

Summary

The Test Traffic Measurements service has been offered by the RIPE NCC as a service since the year 2000. The system continuously records one-way delay and packet-loss measurements as well as router-level paths ("traceroutes") between a large set of probe machines ("test boxes"). The test boxes are hosted by many different organisations, including NRENs, commercial ISPs, universities and others, and usually maintained by RIPE NCC as part of the service. While the vast majority of the test boxes is in Europe, there are a couple of machines in other continents, including outside the RIPE NCC's service area. Every test box includes a precise time source - typically a GPS receiver - so that accurate one-way delay measurements can be provided. Measurement data are entered in a central database at RIPE NCC's premises every night. The database is based on CERN's ROOT system. Measurement results can be retrieved over the Web using various presentations, both pre-generated and "on-demand".

Applicability

RIPE TTM data is often useful to find out "historical" quality information about a network path of interest, provided that TTM test boxes are deployed near (in the topological sense, not just geographically) the ends of the path. For research network paths throughout Europe, the coverage of the RIPE TTM infrastructure is not complete, but quite comprehensive. When one suspects that there is (non-temporary) congestion on a path covered by the RIPE TTM measurement infrastructure, the TTM graphs can be used to easily verify that, because such congestion will show up as delay and/or loss if present.

The RIPE TTM system can also be used by operators to precisely measure - but only after the fact - the impact of changes in the network. These changes can include disturbances such as network failures or misconfigurations, but also things such as link upgrades or routing improvements.

Notes

The TTM project has provided extremely high-quality and stable delay and loss measurements for a large set of network paths (IPv4 and recently also IPv6) throughout Europe. These paths cover an interesting mix of research and commercial networks. The Web interface to the collected measurements supports in-depth exploration quite well, such as looking at the delay/loss evolution of a specific path over both a wide range of intervals from the very short to the very long. On the other hand, it is hard to get at useful "overview" pictures. The RIPE NCC's DNS Server Monitoring (DNSMON) service provides such overviews for specific sets of locations.

The raw data collected by the RIPE TTM infrastructure is not generally available, in part because of restrictions on data disclosure. This is understandable in that RIPE NCC's main member/customer base consists of commercial ISPs, who presumably wouldn't allow full disclosure for competitive reasons. But it also means that in practice, only the RIPE NCC can develop new tools that make use of these data, and their resources for doing so are limited, in part due to lack of support from their ISP membership. On a positive note, the RIPE NCC has, on several occasions, given scientists access to the measurement infrastructure for research. The TTM team also offers to extract raw data (in ROOT format) from the TTM database manually on demand.

Another drawback of RIPE TTM is that the central TTM database is only updated once a day. This makes the system impossible to use for near-real-time diagnostic purposes. (For test box hosts, it is possible to access the data collected by one's hosted test boxes in near realy time, but this provides only very limited functionality, because only information about "inbound" packets is available locally, and therefore one can neither correlate delays with path changes, nor can one compute loss figures without access to the sending host's data.) In contrast, RIPE NCC's Routing Information Service (RIS) infrastructure provides several ways to get at the collected data almost as it is collected. And because RIS' raw data is openly available in a well-defined and relatively easy-to-use format, it is possible for third parties to develop innovative and useful ways to look at that data.

Examples

The following screenshot shows an on-demand graph of the delays over a 24-hour period from one test box hosted by SWITCH (tt85 at ETH Zurich) to to other one (tt86 at the University of Geneva). The delay range has been narrowed so that the fine-grained distribution can be seen. Five packets are listed as "overflow" under "STATISTICS - Delay & Hops" because their one-way delays exceeded the range of the graph. This is an example of an almost completely uncongested path, showing no packet loss and the vast majority of delays very close to the "baseline" value.

ripe-ttm-sample-delay-on-demand.png

The following plot shows the impact of a routing change on a path to a test box in New Zealand (tt47). The hop count decreases by one, but the base one-way delay increased by about 20ms. The old and new routes can be retrieved from the Web interface too. Note also that the delay values are spread out over a fairly large range, which indicates that there is some congestion (and thus queueing) on the path. The loss rate over the entire day is 1.1%. It is somewhat interesting how much of this is due to the routing change and how much (if any) is due to "normal" congestion. The lower right graph shows "arrived/lost" packets over time and indicates that most packets were in fact lost during the path change - so although some congestion is visible in the delays, it is not severe enough to cause visible loss. On the other hand, the total number of probe packets sent (about 2850 per day) means that congestion loss would probably go unnoticed until it reaches a level of 0.1% or so.

day.tt85.tt47.gif

References

-- FrancoisXavierAndreu & SimonLeinen - 06 Jun 2005 - 06 Jul 2005

QoSMetrics Boxes

This measurement infrastructure was built as part of RENATER's Métrologie efforts. The infrastructure performs continuous delay measurement between a mesh of measurement points on the RENATER backbone. The measurements are sent to a central server every minute, and are used to produce a delay matrix and individual delay histories (IPv4 and IPv6 measures, soon for each CoS)).

Please note that all results (tables and graphs) presented on this page are genereted by RENATER's scripts from QoSMetrics MySQL Database.

Screendump from RENATER site (http://pasillo.renater.fr/metrologie/get_qosmetrics_results.php)

new_Renater_IPPM_results_screendump.jpg

Example of asymmetric path:

(after that the problem was resolve, only one path has his hop number decreased)

Graphs:

Paris_Toulouse_delay Toulouse_Paris_delay
Paris_Toulouse_jitter Toulouse_Paris_jitter
Paris_Toulouse_hop_number Toulouse_Paris_hop_number
Paris_Toulouse_pktsLoss Toulouse_Paris_pktsLoss

Traceroute results:

traceroute to 193.51.182.xx (193.51.182.xx), 30 hops max, 38 byte packets
1 209.renater.fr (195.98.238.209) 0.491 ms 0.440 ms 0.492 ms
2 gw1-renater.renater.fr (193.49.159.249) 0.223 ms 0.219 ms 0.252 ms
3 nri-c-g3-0-50.cssi.renater.fr (193.51.182.6) 0.726 ms 0.850 ms 0.988 ms
4 nri-b-g14-0-0-101.cssi.renater.fr (193.51.187.18) 11.478 ms 11.336 ms 11.102 ms
5 orleans-pos2-0.cssi.renater.fr (193.51.179.66) 50.084 ms 196.498 ms 92.930 ms
6 poitiers-pos4-0.cssi.renater.fr (193.51.180.29) 11.471 ms 11.459 ms 11.354 ms
7 bordeaux-pos1-0.cssi.renater.fr (193.51.179.254) 11.729 ms 11.590 ms 11.482 ms
8 toulouse-pos1-0.cssi.renater.fr (193.51.180.14) 17.471 ms 17.463 ms 17.101 ms
9 xx.renater.fr (193.51.182.xx) 17.598 ms 17.555 ms 17.600 ms

[root@CICT root]# traceroute 195.98.238.xx
traceroute to 195.98.238.xx (195.98.238.xx), 30 hops max, 38 byte packets
1 toulouse-g3-0-20.cssi.renater.fr (193.51.182.202) 0.200 ms 0.189 ms 0.111 ms
2 bordeaux-pos2-0.cssi.renater.fr (193.51.180.13) 16.850 ms 16.836 ms 16.850 ms
3 poitiers-pos1-0.cssi.renater.fr (193.51.179.253) 16.728 ms 16.710 ms 16.725 ms
4 nri-a-pos5-0.cssi.renater.fr (193.51.179.17) 22.969 ms 22.956 ms 22.972 ms
5 nri-c-g3-0-0-101.cssi.renater.fr (193.51.187.21) 22.972 ms 22.961 ms 22.844 ms
6 gip-nri-c.cssi.renater.fr (193.51.182.5) 17.603 ms 17.582 ms 17.616 ms
7 250.renater.fr (193.49.159.250) 17.836 ms 17.718 ms 17.719 ms
8 xx.renater.fr (195.98.238.xx) 17.606 ms 17.707 ms 17.226 ms

-- FrancoisXavierAndreu & SimonLeinen - 06 Jun 2005 - 06 Jul 2005 - 07 Apr 2006

Passive Measurement Tools

Network Usage:

  • SNMP-based tools retrieve network information such as state and utilization of links, router CPU loads, etc.: MRTG, Cricket
  • Netflow-based tools use flow-based accounting information from routers for traffic analysis, detection of routing and security problems, denial-of-service attacks etc.

Traffic Capture and Analysis Tools

-- FrancoisXavierAndreu - 06 Jun 2005 -- SimonLeinen - 05 Jan 2006

SNMP-based tools

The Simple Network Management Protocol (SNMP) is widely used for device monitoring over IP, especially for monitoring infrastructure compoments of those networks, such as routers, switches etc. There are many tools that use SNMP to retrieve network information such as the state and utilization of links, routers' CPU load, etc., and generate various types of visualizations, other reports, and alarms.

  • MRTG (Multi-Router Traffic Grapher) is a widely-used open-source tool that polls devices every five minutes using SNMP, and plots the results over day, week, month, and year timescales.
  • Cricket serves the same purpose as MRTG, but is configured in a different way - targets are organized in a tree, which allows inheritance of target specifications. Cricket has been using RRDtool from the start, although MRTG has picked that up (as an option) as well.
  • RRDtool is a round-robin database used to store periodic measurements such as SNMP readings. Recent values are stored at finer time granularity than older values, and RRDtool incrementally "consolidates" values as they age, and uses an efficient constant-size representation. RRDtool is agnostic to SNMP, but it is used by Cricket, MRTG (optionally), and many other SNMP tools.
  • Synagon is a Python-based tool that performs SNMP measurements (Juniper firewall counters) at much shorter timescales (seconds) and graphs them in real time. This can be used to look at momentary link utilizations, within the limits of how often devices actually update their SNMP values.

-- FrancoisXavierAndreu - 06 Jun 2005 -- SimonLeinen - 05 Jan 2006 - 16 Jul 2007

Synagon SNMP Graphing Tool

Overview

Synagon is Python-based program that collects and graphs firewall values from Juniper routers. The tool first requires the user to select a router from a list. It will retrieve all the available firewall filters/counters from that router and the user can select any of these to graph. It was developed by DANTE for internal use, but could be made available on request (on a case by case basis), in an entirely unsupported manner. Please contact operations@dante.org.uk for more details.

[The information below is taken from DANTE's internal documentation.]

Installation and Setup procedures

Synagon comes as a single compressed archive, that can be copied in a directory and then invoked as any other standard python script.

It requires the following python libraries to operate:

1) The python SNMP framework (available at http://pysnmp.sourceforge.net)

The SNMP framework is required to SNMP to the routers. For guidance how to install the packages please see the section below

You also need to create a network.py file that describes the network topology where the tool will act upon. The configuration files should be located in the same directory as the tool itself. Please look at the section below for more information on the format of the file.

  • Example Sysngaon screen:
    Example Sysngaon screen

How to use Synagon

The tool can be invoked by double clicking on the file name (under windows) or by typing the following on the command shell:

python synagon.py

Prior to that, a network.py configuration file must have been created and located at the directory where the tool has been invoked. It describes the location of the routers and the snmp community to use when contacting them.

Immediately after, a window with a single menu entry will appear. By clicking on the menu, a list of all routers described in the network.py will be given as submenus.

Under each router, a submenu shows the list of firewall filters for that router (depending whether you have previously collected the list) together with a ‘collect’ submenu. The collect submenu initiates the collection of the firewall filters from the router and then updates the list of firewall filters in the submenu where the collect button is located. The list of firewall filters of each router is also stored on the local disk and next time the tool is invoked it won’t be necessary to interrogate the router for receiving the firewall list, but they will be taken instead from the file. When the firewall filter list is retrieved, a submenu with all counter and policers for each filter will be displayed. A policer is marked with [p] and a counter with a [c] just after their name.

When a counter or policer of a firewall filter of a router is selected, the SNMP poling for that entity begins. After a few seconds, a window will appear that plots the requested values.

The graph shows the rate of bytes dropped since monitoring had started for a particular counter or the rate of packets dropped since monitoring had started for a particular policer. The save graph button can create a file for the graph in encapsulated postscript format (.eps).

Because values are frequently retrieved, there are memory consumption and performance concerns. To avoid this a configurable limit is placed on the number of samples plotted on graph. The handle at the top of the window limits the number of samples plotted and has an impact on performance if is positioned at a high scale (of thousands).

Developer’s guide

This section provides information about how the tool internally works.

The tool mainly consists of the felektis.py file, but it also relies on the external csnmp.py and graph.py library. The felektis.py defines two threads of control; one initiates the application and is responsible for collecting the results, the other is responsible for the visual part of the application and the interaction with the user.

The GUI thread updates a shared structure (a dictionary) with contact information about all the firewalls the user has chosen by that time to monitor.

active_firewall_list = {}

The structure holds the following information:

router_name->filter_name->counter_name(collect, instance, type, bytes, epoch)

There is an entry per counter and that entry has information about whether the counter should be retrieved (collect), the SNMP instance (instance) for that counter, whether it is a policer or a counter (type), the last byte counter value (bytes), the time the last value was received (epoch).

It operates as follows:

do forever:
   if the user has shutdown the GUI thread:
      terminate

   for each router in active_firewall_list:
      for each filter of each router:
         for each counter/policer of each filter:
            retrieve counter value
            calculate change rate 
            pass rate of that counter to the GUI thread

It uses SNMP to request the values from the router.

The gui thread (MainWindow class) first creates the user interfaces (based on the Tkinter Tcl/TK module) and is passed the geant network topology found in the network.py file. It then creates a menu list with all routers found in the topology. Each router entry has a submenu with an item of ‘collect’. This item is bound to a function and if it is selected, the router is interrogated via SNMP on the Juniper Firewall MIB, and the collection of filter name/counter names for that router is retrieved. Because the interrogation may take tens of seconds, the firewall filter list is serialised onto the local directory using the cPickle builtin module. The file is given the name router_name.filter (e.g.: uk1.filters for router uk1) and it is stored in the local directory where the tool was invoked from. Next time the tool is invoked, it will check if there is a file list for that router, and if so, it will deserialise the list of firewall filters from the file and populate the routers’ submenu. It could of course be the case the serialised firewall list can be out of date, though the ‘collect’ submenu entry for each router is always there and can be invoked to replace the current list (both on memory and disk) with the latest details. If the user selects a firewall or policer to monitor from the submenu list, it will populate the active_firewall_list with that counter/policer, and the main thread will take it up and begin retrieving data for that counter/policer.

The main thread retrieves and passes the calculated rate to the MainWindow thread onto a queue that the main thread populates and the gui thread gets its data from. Because the updates can come from a variety of different firewall filters the queue is indexed with a text string which is a unique identifier based upon router name, firewall filter name, counter/policer name. The gui thread periodically (i.e.: every 250ms) check if there are data in the above mentioned queue, If the identifier hasn’t been seen before, a new window is created where the values are plotted. If the window already exists, then it is updated with that value and its contents are redrawn. A window is an instance of the class PlotList.

When a window is closed by the user, the gui modifies the active_firewall_list for that policer by setting the ‘collect’ attribute to None, and the main thread next time will reach that counter/policer, it will ignore it.

Creating a single synagon package

The tool consists of several files:

  • felegktis.py [main logic]
  • graph.py [graph library]
  • csnmp.py [ SNMP library]

It is possible by using the squeeze tool (available at http://www.pythonware.com/products/python/squeeze/index.htm) to make this file into a compressed python executable archive.

The build_synagon.bat uses the squeeze tool to archive all this file into one.

build_synagon.bat:
REM build the sinagon distribution
python squeezeTool.py -1 -o synagon -b felegktis felegktis.py graph.py csnmp.py

The command above builds synagon.py from files felegktis.py, graph.py and csnmp.py.

Appendices

A. network.py

newtork.py descibes the nework's topology - its format can be deduced by inspection. Currently the network topology name (graph name) should be set to ‘geant’ - the script could be adjusted to change this behaviour.

Compulsory router properties:

  • type: the type of the router [juniper, cisco, etc] etc (the tool only operates on ‘juniper’ routers
  • address: The address where the router can be contacted at
  • community: the SNMP community to authenticate on the router
Example:
# Geant network
geant = graph.Graph()

# Defaults for router
class Router(graph.Vertex):
    """ Common options for geant routers """

    def __init__(self):
        # Initialise base class
        graph.Vertex.__init__(self)
        # Set the default properties for these routers
      self.property['type'] = 'juniper'
        self.property['community'] = 'commpassword'

uk1 = Router()
gr1 = Router()


geant.vertex['uk1'] = uk1
geant.vertex['gr1'] = gr1

uk1.property['address'] = 'uk1.uk.geant.net'
gr1.property['address'] = 'gr1.gr.geant.net'

B. Python Package Installation

Python package installation

Python has a built in support for installing packages.

Packages usually come in a source format and they compiled at the time of the installation. Packages can be complete python source code or have extension in other languages (c, c++) . The packages can also come in binary form.

Many python packages which have extension written in some oher language, need to be compiled into a binary package before distributing to platform like windows, because not all windows machine have c or c++ compilers that a package may require.

The Python SNMP framework library

The source version can be downloaded from http://mysnmp.sourceforge.net

When the file is uncompressed, the user should invoke the setup.py:

setup.py install There is also a binary version for Windows that exists in the NEP’s wiki. It has a simple GUI. Just keep pushing the ‘next’ button until the package is installed.

-- TobyRodwell - 23 Jan 2006

Netflow-based Tools

Netflow: Analysis tools to characterize the traffic in the network, detect routing problems, security problems (Denials of Service), etc.

There are different versions of NDE (NetFlow Data Export): v1 (first), v5, v7, v8 and the last version (v9) which is based on templates. This last version makes it possible to analyze IPv6, MPLS and Multicast flows.

  • NDE v9 description:
    NDE v9 description

References

-- FrancoisXavierAndreu - 2005-06-06 - 2006-04-07
-- SimonLeinen - 2006-01-05 - 2015-07-18

Packet Capture and Analysis Tools:

Detect protocol problems via the analysis of packets, trouble shooting

  • tcpdump: Packet capture tool based on the libpcap library. http://www.tcpdump.org/
  • snoop: tcpdump equivalent in Sun's Solaris operating system
  • Wireshark (formerly called Ethereal): Extended tcpdump with a user-friendly GUI. Its plugin architecture allowed many protocol-specific "dissectors" to be written. Also includes some analysis tools useful for performance debugging.
  • libtrace: A library for trace processing; works with libpcap (tcpdump) and other formats.
  • Netdude: The Network Dump data Displayer and Editor is a framework for inspection, analysis and manipulation of tcpdump trace files.
  • jnettop: Captures traffic coming across the host it is running on and displays streams sorted by the bandwidth they use.
  • tcptrace: a statistics/analysis tool for TCP sessions using tcpdump capture files
  • capturing packets with Endace DAG cards (dagsnap, dagconvert, ...)

General Hints for Taking Packet Traces

Capture enough data - you can always throw away stuff later. Note that tcpdump's default capture length is small (96 bytes), so use -s 0 (or something like -s 1540) if you are interested in payloads. Wireshark and snoop capture entire packets by default. Seemingly unrelated traffic can impact performance. For example, Web pages from foo.example.com may load slowly because of the images from adserver.example.net. In situations of high background traffic, it may however be necessary to filter out unrelated traffic.

It can be extremely useful to collect packet traces from multiple points in the network

  • On the endpoints of the communication
  • Near �suspicious� intermediate points such as firewalls

Synchronized clocks (e.g. by NTP) are very useful for matching traces.

Address-to-name resolution can slow down display and causes additional traffic which confuses the trace. With tcpdump, consider using -n or tracing to file (-w file).

Request remote packet traces

When a packet trace from a remote site is required, this often means having to ask someone at that site to provide it. When requesting such a trace, consider making this as easy as possible for the person having to do it. Try to use a packet tracing tool that is already available - tcpdump for most BSD or Linux systems, snoop for Solaris machines. Windows doesn't seem to come bundled with a packet capturing program, but you can direct the user to Wireshark, which is reasonably easy to install and use under Windows. Try to give clean indications on how to call the packet capture program. It is usually best to ask the user to capture to a file, and then send you the capture file as an e-mail attachment or so.

References

  • Presentation slides from a short talk about packet capturing techniques given at the network performance section of the January 2006 GEANT2 technical workshop

-- FrancoisXavierAndreu - 06 Jun 2005
-- SimonLeinen - 05 Jan 2006-09 Apr 2006
-- PekkaSavola - 26 Oct 2006

tcpdump

One of the early diagnostic tools for TCP/IP that was written by Van Jacobson, Craig Leres, and Steven McCanne. Tcpdump can be used to capture and decode packets in real time, or to capture packets to files (in "libpcap" format, see below), and analyze (decode) them later.

There are now more elaborate and, in some ways, user-friendly packet capturing programs, such as Wireshark (formerly called Ethereal), but tcpdump is widely available, widely used, so it is very useful to know how to use it.

Tcpdump/libpcap is still actively being maintained, although not by its original authors.

libpcap

Libpcap is a library that is used by tcpdump, and also names a file format for packet traces. This file format - usually used in files with the extension .pcap - is widely supported by packet capture and analysis tools.

Selected options

Some useful options to tcpdump include:

  • -s snaplen capture snaplen bytes of each frame. By default, tcpdump captures only the first 68 bytes, which is sufficient to capture IP/UDP/TCP/ICMP headers, but usually not payload or higher-level protocols. If you are interested in more than just headers, use -s 0 to capture packets without truncation.
  • -r filename read from an previously created capture file (see -w)
  • -w filename dump to a file instead of analyzing on-the-fly
  • -i interface capture on an interface other than the default (first "up" non-loopback interface). Under Linux, -i any can be used to capture on all interfaces, albeit with some restrictions.
  • -c count stop the capture after count packets
  • -n don't resolve addresses, port numbers, etc. to symbolic names - this avoids additional DNS traffic when analyzing a live capture.
  • -v verbose output
  • -vv more verbose output
  • -vvv even more verbose output

Also, a pflang expression can be appended to the command so as to filter the captured packets. An expression is made up of one or more of "type", "direction" and "protocol".

  • type Can be host, net or port. host is presumed unless otherwise speciified
  • dir Can be src, dst, src or dst or src and dst
  • proto (for protocol) Common types are ether, ip, tcp, udp, arp ... If none is specifiied then all protocols for which the value is valid are considered.
Example expressions:
  • dst host <address>
  • src host <address>
  • udp dst port <number>
  • host <host> and not port ftp and not port ftp-data

Usage examples

Capture a single (-c 1) udp packet to file test.pcap:

: root@diotima[tmp]; tcpdump -c 1 -w test.pcap udp
tcpdump: listening on bge0, link-type EN10MB (Ethernet), capture size 96 bytes
1 packets captured
3 packets received by filter
0 packets dropped by kernel

This produces a binary file containing the captured packet as well as a small file header and a timestamp:

: root@diotima[tmp]; ls -l test.pcap
-rw-r--r-- 1 root root 114 2006-04-09 18:57 test.pcap
: root@diotima[tmp]; file test.pcap
test.pcap: tcpdump capture file (big-endian) - version 2.4 (Ethernet, capture length 96)

Analyze the contents of the previously created capture file:

: root@diotima[tmp]; tcpdump -r test.pcap
reading from file test.pcap, link-type EN10MB (Ethernet)
18:57:28.732789 2001:630:241:204:211:43ff:fee1:9fe0.32832 > ff3e::beac.10000: UDP, length: 12

Display the same capture file in verbose mode:

: root@diotima[tmp]; tcpdump -v -r test.pcap
reading from file test.pcap, link-type EN10MB (Ethernet)
18:57:28.732789 2001:630:241:204:211:43ff:fee1:9fe0.32832 > ff3e::beac.10000: [udp sum ok] UDP, length: 12 (len 20, hlim 118)

More examples with some advanced tcpdump use cases.

References

-- SimonLeinen - 2006-03-04 - 2016-03-17 Warning: Can't find topic PERTKB.EthEreal

Solaris snoop

Sun's Solaris operating system includes a utility called snoop for taking and analyzing packet traces. While snoop is similar in spirit to tcpdump, its options, filter syntax, and stored file formats are all slightly different. The file format is described in the Informational RFC 1761, and supported by some other tools, notably Wireshark.

Selected options

Useful options in connection with snoop include:

  • -i filename read from capture file. Corresponds to tcpdump's -r option.
  • -o filename dump to file. Corresponds to tcpdump's -w option.
  • -d device capture on device rather than on the default (first non-loopback) interface. Corresponds to tcpdump's -i option.
  • -c count stop capture after capturing count packets (same as for tcpdump)
  • -r do not resolve addresses to names. Corresponds to tcpdump's -n option. This is useful for suppressing DNS traffic when looking at a live trace.
  • -V verbose summary mode - between (default) summary mode and verbose mode
  • -v (full) verbose mode

Usage examples

Capture a single (-c 1) udp packet to file test.snoop:

: root@diotima[tmp]; snoop -o test.snoop -c 1 udp
Using device /dev/bge0 (promiscuous mode)
1 1 packets captured

This produces a binary file containing the captured packet as well as a small file header and a timestamp:

: root@diotima[tmp]; ls -l test.snoop
-rw-r--r-- 1 root root 120 2006-04-09 18:27 test.snoop
: root@diotima[tmp]; file test.snoop
test.snoop:     Snoop capture file - version 2

Analyze the contents of the previously created capture file:

: root@diotima[tmp]; snoop -i test.snoop
  1   0.00000 2001:690:1fff:1661::3 -> ff1e::1:f00d:beac UDP D=10000 S=32820 LEN=20

Display the same capture file in "verbose summary" mode:

: root@diotima[tmp]; snoop -V -i test.snoop
________________________________
  1   0.00000 2001:690:1fff:1661::3 -> ff1e::1:f00d:beac ETHER Type=86DD (IPv6), size = 74 bytes
  1   0.00000 2001:690:1fff:1661::3 -> ff1e::1:f00d:beac IPv6  S=2001:690:1fff:1661::3 D=ff1e::1:f00d:beac LEN=20 HOPS=116 CLASS=0x0 FLOW=0x0
  1   0.00000 2001:690:1fff:1661::3 -> ff1e::1:f00d:beac UDP D=10000 S=32820 LEN=20

Finally, in full verbose mode:

: root@diotima[tmp]; snoop -v -i test.snoop
ETHER:  ----- Ether Header -----
ETHER:
ETHER:  Packet 1 arrived at 18:27:16.81233
ETHER:  Packet size = 74 bytes
ETHER:  Destination = 33:33:f0:d:be:ac, (multicast)
ETHER:  Source      = 0:a:f3:32:56:0,
ETHER:  Ethertype = 86DD (IPv6)
ETHER:
IPv6:   ----- IPv6 Header -----
IPv6:
IPv6:   Version = 6
IPv6:   Traffic Class = 0
IPv6:   Flow label = 0x0
IPv6:   Payload length = 20
IPv6:   Next Header = 17 (UDP)
IPv6:   Hop Limit = 116
IPv6:   Source address = 2001:690:1fff:1661::3
IPv6:   Destination address = ff1e::1:f00d:beac
IPv6:
UDP:  ----- UDP Header -----
UDP:
UDP:  Source port = 32820
UDP:  Destination port = 10000
UDP:  Length = 20
UDP:  Checksum = 0635
UDP:

Note: the captured packet is an IPv6 multicast packet from a "beacon" system. Such traffic forms the majority of the background load on our office LAN at the moment.

References

  • RFC 1761, Snoop Version 2 Packet Capture File Format, B. Callaghan, R. Gilligan, February 1995
  • snoop(1) manual page for Solaris 10, April 2004, docs.sun.com

-- SimonLeinen - 09 Apr 2006

libtrace

libtrace is a library for trace processing. It supports multiple input methods, including device capture, raw and gz-compressed trace, and sockets; and mulitple input formats, including pcap and DAG.

References

-- SimonLeinen - 04 Mar 2006

Netdude

Netdude (Network Dump data Displayer and Editor) is a framework for inspection, analysis and manipulation of tcpdump (libpcap) trace files.

References

-- SimonLeinen - 04 Mar 2006

jnettop

Captures traffic coming across the host it is running on and displays streams sorted by the bandwidth they use. The result is a nice listing of communication on network grouped by stream, which shows transported bytes and consumed bandwidth.

References

-- FrancoisXavierAndreu - 06 Jun 2005 -- SimonLeinen - 04 Mar 2006

Host and Application Measurement Tools

Network performance is only one part of a distrbuted system's perfomance. An equally important part is the performance od the actual application and hardware it runs on.

  • Web100 - fine-grained instrumentation of TCP implementation, currently available for the Linux kernel.
  • SIFTR - TCP instrumentation similar to Web100, available for *BSD
  • DTrace - Dynamic tracing facility available in Solaris and *BSD
  • NetLogger - a framework for instrumentation and log collection from various components of networked applications

-- TobyRodwell - 22 Mar 2006
-- SimonLeinen - 28 Mar 2006

NetLogger

NetLogger (copyright Lawrence Berkeley National Laboratory) is both a set of tools and a methodology for analysing the performance of a distributed system. The methodolgoy (below) can be implemented separately from the LBL developed tools. For full information on NetLogger see its website, http://dsd.lbl.gov/NetLogger/

Methodology

From the NetLogger website

The NetLogger methodology is really quite simple. It consists of the following:

  1. All components must be instrumented to produce monitoring These components include application software, middleware, operating system, and networks. The more components that are instrumented the better.
  2. All monitoring events must use a common format and common set of attributes. Monitoring events most also all contain a precision timestamp which is in a single timezone (GMT) and globally synchronized via a clock synchronization method such as NTP.
  3. Log all of the following events: Entering and exiting any program or software component, and begin/end of all IO (disk and network).
  4. Collect all log data in a central location
  5. Use event correlation and visualization tools to analyze the monitoring event logs.

Toolkit

From the NetLogger website

The NetLogger Toolkit includes a number of separate components which are designed to help you do distributed debugging and performance analysis. You can use any or all of these components, depending on your needs.

These include:

  • NetLogger message format and data model: A simple, common message format for all monitoring events which includes high-precision timestamps
  • NetLogger client API library: C/C++, Java, PERL, and Python calls that you add to your existing source code to generate monitoring events. The destination and logging level of NetLogger messages are all easily controlled using an environment variable. These libraries are designed to be as lightweight as possible, and to never block or adversely affect application performance.
  • NetLogger visualization tool (nlv): a powerful, customizable X-Windows tool for viewing and analysis of event logs based on time correlated and/or object correlated events.
  • NetLogger host/network monitoring tools: a collection of NetLogger-instrumented host monitoring tools, including tools to interoperate with Ganglia and MonALisa.
  • NetLogger storage and retrieval tools, including a daemon that collects NetLogger events from several places at a single, central host; a forwarding daemon to forward all NetLogger files in a specified directory to a given location; and a NetLogger event archive based on mySQL.

References

-- TobyRodwell - 22 Mar 2006

Web100 Linux Kernel Extensions

The Web100 project was run by PSC (Pittsburgh Supercomputing Center), the NCAA and NCSA. It was funded by the US National Science Foundation (NSF) between 2000 and 2003, although development and maintenance work extended well beyond the period of NSF funding. Its thrust was to close the "wizard gap" between what performance should be possible on modern research networks and what most users of these networks actually experience. The project focused on instrumentation for TCP to measure performance and find possible bottlenecks that limit TCP throughput. In addition, it included some work on "auto-tuning" of TCP buffer settings.

Most implementation work was done for Linux, and most of the auto-tuning code is now actually included in the mainline Linux kernel code (as of 2.6.17). The TCP kernel instrumentation is available as a patch from http://www.web100.org/, and usually tracks the latest "official" Linux kernel release pretty closely.

An interesting application of Web100 is NDT, which can be used from any Java-enabled browser to detect bottlenecks in TCP configurations and network paths, as well as duplex mismatches using active TCP tests against a Web100-enabled server.

In September 2010, the NSF agreed to fund a follow-on project called Web10G.

TCP Kernel Information Set (KIS)

A central component of Web100 is a set of "instruments" that permits the monitoring of many statistics about TCP connections (sockets) in the kernel. In the Linux implementation, these instruments are accessible through the proc filesystem.

TCP Extended Statistics MIB (TCP-ESTATS-MIB, RFC 4898)

The TCP-ESTATS-MIB (RFC 4898) includes a similar set of instruments, for access through SNMP. It has been implemented by Microsoft for the Vista operating system and later versions of Windows. In the Windows Server 2008 SDK, a tool called TcpAnalyzer.exe can be used to look at statistics of open TCP connections. IBM is also said to have an implementation of this MIB.

"Userland" Tools

Besides the kernel extension, Web100 comprises a small set of user-level tools which provide access to the TCP KIS. These tools include

  1. a libweb100 library written in C
  2. the command-line tools readvar, deltavar, writevar, and readall
  3. a set of GTK+-based GUI (graphical user interface) tools under the gutil command.

gutil

When started, gutil shows a small main panel with an entry field for specifying a TCP connection, and several graphical buttons for starting different tools on a connection once one has been selected.

gutil: main panel

The TCP connection can be chosen either by explicitly specifying its endpoints, or by selecting from a list of connections (using double-click):

gutil: TCP connection selection dialog

Once a connection of interest has been selected, a number of actions are possible. The "List" action provides a list of all kernel instruments for the connection. The list is updated every second, and "delta" values are displayed for those variables that have changed.

gutil: variable listing

Another action is "Display", which provides a graphical display of a KIS variable. The following screenshot shows the display of the DataBytesIn variable of an SSH connection.

gutil: graphical variable display

Related Work

Microsoft's Windows Software Development Kit for Server 2008 and .NET Version 3.5 contains a tool called TcpAnalyzer.exe, which is similar to gutil and uses Microsoft's RFC 4898 implementation.

The SIFTR module for FreeBSD can be used for similar applications, namely to understand what is happening inside TCP at fine granularity. Sun's DTrace would be an alternative on systems that support it, provided they offer suitable probes for the relevant actions and events within TCP. Both SIFTR and DTrace have very different user interfaces to Web100.

References

-- SimonLeinen - 27 Feb 2006 - 25 Mar 2011
-- ChrisWelti - 12 Jan 2010

End-System (Host) Tuning

This section contains hints for tuning end systems (hosts) for maximum performance.

References

-- SimonLeinen - 31 Oct 2004 - 23 Jul 2009
-- PekkaSavola - 17 Nov 2006

<-- ---++Calculating required TCP buffer sizes
Regardless of operating system, TCP buffers must be appropriately dimensioned if they are to support long distance, high data rate flows.  Victor Reijs's TCP Throughput calculator is able to determine the maximum achievable TCP throughput based on RTT, expected packet loss, segment size, re-transmission timeout (RTO) and buffer size.  It can be found at:

http://people.heanet.ie/~vreijs/tcpcalculations.htm

Tolerance to Packet Re-ordering

As explained in the section on packet re-ordering, packets can be re-ordered wherever there is parellelism in a network. If such parellelism exists in a network (e.g. as caused by the internal architectire of the Juniper M160s used in the GEANT network) then TCP should be set to tolerate a larger number of re-ordered packets than the default, which is 3 (see here for an example for Linux). For large flows across GEANT a value of 20 should be sufficient - an alternative work-round is to use multiple parallel TCP flows.

Modifying TCP Behaviour

TCP Congestion control algorithms are often configured conservatively, to avoid 'slow start' overwhelming a network path. In a capacity rich network this may not be necessary, and may be over-ridden.

-- TobyRodwell - 24 Jan 2005

<-- 
-- TobyRodwell - 24 Jan 2005

[The following information is courtesy of Jeff Boote, from Internet2, and was forwarded to me by Nicolas Simar.]

Operating Systems (OSs) have a habit of reconfiguring network interface settings back to their default, even when the correct values have been written in a specific config file. This is the result of bugs, and they appear in almost all OSs. Sometimes they get fixed in a given release but then get broken again in a later release. It is not known why this is the case but it may be partly that driver programmers don't test their products under conditions of large latency. It is worth noting that experience shows 'ifconfig' works well for ''txqueuelength' and 'MTU'. -->

Operating-System Specific Configuration Hints

This topic points to configuration hints for specific operating systems.

Note that the main perspective of these hints is how to achieve good performance in a big bandwidth-delay-product environment, typically with a single stream. See the ServerScaling topic for some guidance on tuning servers for many concurrent connections.

-- SimonLeinen - 27 Oct 2004 - 02 Nov 2008
-- PekkaSavola - 02 Oct 2008

OS-Specific Configuration Hints: BSD Variants

Performance tuning

By default, earlier BSD versions use very modest buffer sizes and don't enable Window Scaling by default. See the references below how to do so.

In contrast, FreeBSD 7.0 introduced TCP buffer auto-tuning, and thus should provide good TCP performance out of the box even over LFNs. This release also implements large-send offload (LSO) and large-receive offload (LRO) for some Ethernet adapters. FreeBSD 7.0 also announces the following in its presentation of technological advances:

10Gbps network optimization: With optimized device drivers from all major 10gbps network vendors, FreeBSD 7.0 has seen extensive optimization of the network stack for high performance workloads, including auto-scaling socket buffers, TCP Segment Offload (TSO), Large Receive Offload (LRO), direct network stack dispatch, and load balancing of TCP/IP workloads over multiple CPUs on supporting 10gbps cards or when multiple network interfaces are in use simultaneously. Full vendor support is available from Chelsio, Intel, Myricom, and Neterion.

Recent TCP Work

The FreeBSD Foundation has granted a project "Improvements to the FreeBSD TCP Stack" to Lawrence Stewart at Swinburne University. Goals for this project include support for Appropriate Byte Counting (ABC, RFC 3465), merging SIFTR into the FreeBSD codebase, and improving the implementation of the reassembly queue. Information is available on http://caia.swin.edu.au/urp/newtcp/. Other improvements done by this group are support for modular congestion control, implementations of CUBIC, H-TCP, TCP Vegas, the SIFTR TCP implementation tool, and a testing framework including improvements to iperf for better buffer size control.

The CUBIC implementation for FreeBSD was announced on the ICCRG mailing list in September 2009. Currently available as patches for the 7.0 and 8.0 kernels, it is planned to be merged "into the mainline FreeBSD source tree in the not too distant future".

In February 2010, a set of Software for FreeBSD TCP R&D was announced by the Swinburne group: This includes modular TCP congestion control, Hamilton TCP (H-TCP), a newer "Hamilton Delay" Congestion Control Algorithm v0.1, Vegas Congestion Control Algorithm v0.1, as well as a kernel helper/hook framework ("Khelp") and a module (ERTT) to improve RTT estimation.

Another release in August 2010 added a "CAIA-Hamilton Delay" congestion control algorithm as well as revised versions of the other components.

QoS tools

On BSD systems ALTQ implements a couple of queueing/scheduling algorithms for network interfaces, as well as some other QoS mechanisms.

To use ALTQ on a FreeBSD 5.x or 6.x box, the following are the necessary steps:

  1. build a kernel with ALTQ
    • options ALTQ and some others beginning with ALTQ_ should be added to the kernel config. Please refer to the ALTQ(4) man page.
  2. define QoS settings in /etc/pf.conf
  3. use pfctl to apply those settings
    • Set pf_enable to YES in /etc/rc.conf (set as well the other variables related to pf according to your needs) in order to apply the QoS settings every time the host boots.

References

-- SimonLeinen - 27 Jan 2005 - 18 Aug 2010
-- PekkaSavola - 17 Nov 2006

Linux-Specific Network Performance Tuning Hints

Linux has its own implementation of the TCP/IP Stack. With recent kernel versions, the TCP/IP implementation contains many useful performance features. Parameters can be controlled via the /proc interface, or using the sysctl mechanism. Note that although some of these parameters have ipv4 in their names, they apply equally to TCP over IPv6.

A typical configuration for high TCP throughput over paths with high bandwidth*delay product would include the following in /etc/sysctl.conf:

A description of each parameter listed below can be found in section Linux IP Parameters.

Basic tuning

TCP Socket Buffer Tuning

See the EndSystemTcpBufferSizing topic for general information about sizing TCP buffers.

Since 2.6.17 kernel, buffers have sensible automatically calculated values for most uses. Unless very high RTT, loss or performance requirement (200+ Mbit/s) is present, buffer settings may not need to be tuned at all.

Nonetheless, the following values may be used:

net/core/rmem_max=16777216
net/core/wmem_max=16777216
net/ipv4/tcp_rmem="8192 87380 16777216"
net/ipv4/tcp_wmem="8192 65536 16777216"

With kernel < 2.4.27 or < 2.6.7, receive-side autotuning may not be implemented, and the default (middle value) should be increased (at the cost of higher, by-default memory consumption):

net/ipv4/tcp_rmem="8192 16777216 16777216"

NOTE: If you have a server with hundreds of connections, you might not want to use a large default value for TCP buffers, as memory may quickly run out smile

There is a subtle but important implementation detail in the socket buffer management of Linux. When setting either the send- or receive buffer sizes via the SO_SNDBUF and SO_RCVBUF socket options via setsockopt(2), the value passed in the system call is doubled by the kernel to accomodate buffer management overhead. Reading the values back with getsockopt(2) return this modified value, but the effective buffer available to TCP payload is still the original value.

The values net/core/rmem and net/core/wmem apply to the argument to setsockopt(2).

In contrast, the maximum values of net/ipv4/tcp_rmem=/=net/ipv4/tcp_wmem apply to the total buffer sizes including the factor of 2 for the buffer management overhead. As a consequence, those values must be chosen twice as large as required by a particular BandwidthDelayProduct. Also note taht the values net/core/rmem and net/core/wmem do not apply to the TCP autotuning mechanism.

Interface queue lengths

InterfaceQueueLength describes how to adjust interface transmit and receive queue lengths. This tuning is typically needed with GE or 10GE transfers.

Host/adapter architecture implications

When going for 300 Mbit/s performance, it is worth verifying that host architecture (e.g., PCI bus) is fast enough. PCI Express is usually fast enough to no longer be the bottleneck in 1Gb/s and even 10Gb/s applications.

For the older PCI/PCI-X buses, when going for 2+ Gbit/s performance, the Maximum Memory Read Byte Count (MMRBC) usually needs to be increased using setpci.

Many network adapters support features such as checksum offload. In some cases, however, these may even decrease performance. In particular, TCP Segment Offload may need to be disabled, with:

ethtool -K eth0 tso off

Advanced tuning

Sharing congestion information across connections/hosts

2.4 series kernels have a TCP/IP weakness in that their interface buffers' maximum window size is based on the experience of previous connections - if you have loss at any point (or a bad end host at the same route) you limit your future TCP connections. So, you may have to flush the route cache to improve performance.

net.ipv4.route.flush=1

2.6 kernels also remember some performance characteristics across connections. In benchmarks and other tests, this might not be desirable.

# don't cache ssthresh from previous connection
net.ipv4.tcp_no_metrics_save=1

Other TCP performance variables

If there is packet reordering in the network, reordering could end up being interpreted as a packet loss too easily. Increasing tcp_reordering parameter might help in that case:

net/ipv4/tcp_reordering=20   # (default=3)

Several variables already have good default values, but it may make sense to check that these defaults haven't been changed:

net/ipv4/tcp_timestamps=1
net/ipv4/tcp_window_scaling=1
net/ipv4/tcp_sack=1
net/ipv4/tcp_moderate_rcvbuf=1

TCP Congestion Control algorithms

Linux 2.6.13 introduced pluggable congestion modules, which allows you to select one of the high-speed TCP congestion control variants, e.g. CUBIC

net/ipv4/tcp_congestion_control = cubic

Alternative values include highspeed (HS-TCP), scalable (Scalable TCP), htcp (Hamilton TCP), bic (BIC), reno ("Reno" TCP), and westwood (TCP Westwood).

Note that on Linux 2.6.19 and later, CUBIC is already used as the default algorithm.

Web100 kernel tuning

If you are using a web100 kernel, the following parameters seem to improve networking performance even further:

# web100 tuning
# turn off using txqueuelen as part of congestion window computation
net/ipv4/WAD_IFQ = 1

QoS tools

Modern Linux kernels have flexible traffic shaping built in.

See the Linux traffic shaping example for an illustration of how these mechanisms can be used to solve a real performance problem.

References

See also the reference to Congestion Control in Linux in the FlowControl topic.

-- SimonLeinen - 27 Oct 2004 - 23 Jul 2009
-- ChrisWelti - 27 Jan 2005
-- PekkaSavola - 17 Nov 2006
-- AlexGall - 28 Nov 2016 Warning: Can't find topic PERTKB.TxQueueLength

OS-Specific Configuration Hints: Mac OS X

As Mac OS X is mainly a BSD derivative, you can use similar mechanisms to tune the TCP stack - see under BsdOSSpecific.

TCP Socket Buffer Tuning

See the EndSystemTcpBufferSizing topic for general information about sizing TCP buffers.

For testing temporary improvements, you can directly use sysctl in a terminal window: (you have to be root to do that)

sysctl -w kern.ipc.maxsockbuf=8388608
sysctl -w net.inet.tcp.rfc1323=1
sysctl -w net.inet.tcp.sendspace=1048576
sysctl -w net.inet.tcp.recvspace=1048576
sysctl -w kern.maxfiles=65536
sysctl -w net.inet.udp.recvspace=147456
sysctl -w net.inet.udp.maxdgram=57344
sysctl -w net.local.stream.recvspace=65535
sysctl -w net.local.stream.sendspace=65535

For permanent changes that last over a reboot, insert the appropriate configurations into Ltt>/etc/sysctl.conf. If this file does not exist must create it. So, for the above, just add the following lines to sysctl.conf:

kern.ipc.maxsockbuf=8388608
net.inet.tcp.rfc1323=1
net.inet.tcp.sendspace=1048576
net.inet.tcp.recvspace=1048576
kern.maxfiles=65536
net.inet.udp.recvspace=147456
net.inet.udp.maxdgram=57344
net.local.stream.recvspace=65535
net.local.stream.sendspace=65535

Note This only works for OSX 10.3 or later! For earlier versions you need to use /etc/rc where you can enter whole sysctl commands.

Users that are unfamiliar with terminal windows can also use the GUI tool "TinkerTool System" and use its Network Tuning option to set the TCP buffers.

TinkerTool System is available from:

-- ChrisWelti - 30 Jun 2005

OS-Specific Configuration Hints: Solaris (Sun Microsystems)

Planned Features

Pluggable Congestion Control for TCP and SCTP

A proposed OpenSolaris project foresees the implementation of pluggable congestion control for both TCP and SCTP. HS-TCP and several other congestion control algorithms for OpenSolaris. This includes implementation of the HighSpeed, CUBIC, Westwood+, and Vegas congestion control algorithms, as well as ipadm subcommands and socket options to get and set congestion control parameters.

On 15 December 2009, Artem Kachitchkine posted an initial draft of a work-in-progress design specification for this feature has been announced on the OpenSolaris networking-discuss forum. According to this proposal, the API for setting the congestion control mechanism for a specific TCP socket will be compatible with Linux: There will be a TCP_CONGESTION socket option to set and retrieve a socket's congestion control algorithm, as well as a TCP_INFO socket option to retrieve various kinds of information about the current congestion control state.

The entire plugabble-congestion control mechanism will be implemented for SCTP in addition to TCP. For example, there will also be an SCTP_CONGESTION socket option. Note that congestion control in SCTP is somewhat trickier than in TCP, because a single SCTP socket can have multiple underlying paths through SCTP's "multi-homing" feature. Congestion control state must be kept separately for each path (address pair). This also means that there is no direct SCTP equivalent to TCP_INFO. The current proposal adds a subset of TCP_INFO's information to the result of the existing SCTP_GET_PEER_ADDR_INFO option for getsockopt().

The internal structure of the code will be somewhat different to what is in the Linux kernel. In particular, the general TCP code will only make calls to the algorithm-specific congestion control modules, not vice versa. The proposed Solaris mechanism also contains ipadm properties that can be used to set the default congestion control algorithm either globally or for a specific zone. The proposal also suggests "observability" features; for example, pfiles output should include the congestion algorithm used for a socket, and there are new kstat statistics that count certain congestion-control events.

Useful Features

TCP Multidata Transmit (MDT, aka LSO)

Solaris 10, and Solaris 9 with patches, supports TCP Multidata Transmit (MDT), which is Sun's name for (software-only) Large Send Offload (LSO). In Solaris 10, this is enabled by default, but in Solaris 9 (with the required patches for MDT support), the kernel and driver have to be reconfigured to be able to use MDT. See the following pointers for more information from docs.sun.com:

Solaris 10 "FireEngine"

The TCP/IP stack in Solaris 10 has been largely rewritten from previous versions, mostly to improve performance. In particular, it supports Interrupt Coalescence, integrates TCP and IP more closely in the kernel, and provides multiprocessing enhancements to distribute work more efficiently over multiple processors. Ongoing work includes UDP/IP integration for better performance of UDP applications, and a new driver architecture that can make use of flow classification capabilities in modern network adapters.

Solaris 10: New Network Device Driver Architecture

Solaris 10 introduces GLDv3 (project "Nemo"), a new driver architecture that generally improves performance, and adds support for several performance features. Some, but not all, Ethernet device drivers were ported over to the new architecture and benefit from those improvements. Notably, the bge driver was ported early, and the new "Neptune" adapters ("multithreaded" dual-port 10GE and four-port GigE with on-board connection de-multiplexing hardware) used it from the start.

Darren Reed has posted a small C program that lists the active acceleration features for a given interface. Here's some sample output:

$ sudo ./ifcapability
lo0 inet
bge0 inet +HCKSUM(version=1 +full +ipv4hdr) +ZEROCOPY(version=1 flags=0x1) +POLL
lo0 inet6
bge0 inet6 +HCKSUM(version=1 +full +ipv4hdr) +ZEROCOPY(version=1 flags=0x1)

Displaying and setting link parameters with dladm

Another OpenSolaris project called Brussels unifies many aspects of network driver configuration under the dladm command. For example, link MTUs (for "Jumbo Frames") can be configured using

dladm set-linkprop -p mtu=9000 web1

The command can also be used to look at current physical parameters of interfaces:

$ sudo dladm show-phys
LINK         MEDIA                STATE      SPEED DUPLEX   DEVICE
bge0         Ethernet             up         1000 full      bge0

Note that Brussels is still being integrated into Solaris. Driver support was added since SXCE (Solaris Express Community Edition) build 83 for some types of adapters. Eventually this should be integrated into regular Solaris releases.

Setting TCP buffers

# To increase the maximum tcp window
# Rule-of-thumb: max_buf = 2 x cwnd_max (congestion window)
ndd -set /dev/tcp tcp_max_buf 4194304
ndd -set /dev/tcp tcp_cwnd_max 2097152

# To increase the DEFAULT tcp window size
ndd -set /dev/tcp tcp_xmit_hiwat 65536
ndd -set /dev/tcp tcp_recv_hiwat 65536

Pitfall when using asymmetric send and receive buffers

The documented default behaviour (tunable TCP parameter tcp_wscale_always = 0) of Solaris is to include the TCP window scaling option in an initial SYN packet when either the send or the receive buffer is larger than 64KiB. From the tcp(7P) man page:

          For all applications, use ndd(1M) to modify the  confi-
          guration      parameter      tcp_wscale_always.      If
          tcp_wscale_always is set to 1, the window scale  option
          will  always be set when connecting to a remote system.
          If tcp_wscale_always is 0, the window scale option will
          be set only if the user has requested a send or receive
          window  larger  than  64K.   The   default   value   of
          tcp_wscale_always is 0.

However, Solaris 8, 9 and 10 do not take the send window into account. This results in an unexpected behaviour for a bulk transfer from node A to node B when the bandwidth-delay product is larger than 64KiB and

  • A's receive buffer (tcp_recv_hiwat) < 64KiB
  • A's send buffer (tcp_xmit_hiwat) > 64KiB
  • B's receive buffer > 64KiB

A will not advertize the window scaling option and B will not do so either according to RFC 1323. As a consequence, throughput will be limited by a congestion window of 64KiB.

As a workaround, the window scaling option can be forcibly advertised by setting

# ndd -set /dev/tcp tcp_wscale_always 1

A bug report has been filed with Sun Microsystems.

References

-- ChrisWelti - 11 Oct 2005, added section for setting default and maximum TCP buffers
-- AlexGall - 26 Aug 2005, added section on pitfall with asymmetric buffers
-- SimonLeinen - 27 Jan 2005 - 16 Dec 2009

Windows-Specific Host Tuning

Next Generation TCP/IP Stack in Windows Vista / Windows Server 2008 / Windows 7

According to http://www.microsoft.com/technet/community/columns/cableguy/cg0905.mspx, the new version of Windows, Vista, features a redesigned TCP/IP stack. Besides unifying IPv4/IPv6 dual-stack support, this new stack also claims much-improved performance for high-speed and asymmetric networks, as well as several auto-tuning features.

The Microsoft Windows Vista Operating System enables the TCP Window Scaling option by default (previous Windows OSes had this option disabled). This causes problems with various middleboxes, see WindowScalingProblems. As a consequence, the scaling factor is limited to 2 for HTTP traffic in Vista.

Another new feature of Vista is Compound TCP, a high-performance TCP variant that uses delay information to adapt its transmission rate (in addition to loss information). On Windows Server 2008 it is enabled by default, but it is disabled in Vista. You can enable it in Vista by running the following command as an Administrator on the command line.

netsh interface tcp set global congestionprovider=ctcp
(see CompoundTCP for more information).

Although the new TCP/IP-Stack has integrated auto-tuning for receive buffers, the TCP send buffer still seems to be limited to 16KB by default. This means you will not be able to get good upload rates for connections with higher RTTs. However, using the registry hack (see further below) you can manually change the default value to something higher.

Performance tuning for earlier Windows versions

Enabling TCP Window Scaling

The references below detail the various "whys" and "hows" to tune Windows network performance. Of particular note however is the setting of scalable TCP receive windows (see LargeTcpWindows).

It appears that, by default, not only does Windows not support TCP 1323 scalable windows, but the required key is not even in the Windows registry. The key (Tcp1323Opts) can be added to one of two places, depending on the particular system:

[HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\VxD\MSTCP] (Windows'95)
[HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\Tcpip\Parameters] (Windows 2000/XP)

Value Name: Tcp1323Opts
Data Type: REG_DWORD (DWORD Value)
Value Data: 0, 1, 2 or 3

  • 0 = disable RFC 1323 options
  • 1 = window scale enabled only
  • 2 = time stamps enabled only
  • 3 = both options enabled

This can also be adjusted using DrTCP GUI tool (see link below)

TCP Buffer Sizes

See the EndSystemTcpBufferSizing topic for general information about sizing TCP buffers.

Buffers for TCP Send Windows

Inquiry at Microsoft (thanks to Larry Dunn) has revealed that the default send window is 8KB and that there is no official support for configuring a system-wide default. However, the current Winsock implementation uses the following undocumented registry key for this purpose

[HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\AFD\Parameters]

Value Name: DefaultSendWindow
Data Type: REG_DWORD (DWORD Value)
Value Data: The window size in bytes. The maximum value is unknown.

According to Microsoft, this parameter may not be supported in future Winsock releases. However, this parameter is confirmed to be working with Windows XP, Windows 2000, Windows Vista and even the Windows 7 Beta 1.

This value needs to be manually adjusted (e.g., DrTCP can't do it), or the application needs to set it (e.g., iperf with '-l' option). It may be difficult to detect this as a bottleneck because the host will advertise large windows but will only have about 8.5KB of data in flight.

Buffers for TCP Receive Windows

(From this TechNet article

Assuming Window Scaling is enabled, then the following parameter can be set to between 1 and 1073741823 bytes

[HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\Tcpip\Parameters\TcpWindowSize]

Value Name: TcpWindowSize
Data Type: REG_DWORD (DWORD Value)
Value Data: The window size in bytes. 1 to 65535 (or 1 to 1073741823 if Window scaling set).

DrTCP can also be used to adjust this parameter.

Windows XP SP2/SP3 simultaneous connection limit

Windows XP SP2 and SP3 limits the number of simultaneous TCP connection attempts (so called half-open state) to 10 (by default). How exactly it's doing this is not clear, but this may come up with some kind of applications. Especially file sharing applications like bittorrent might suffer from this behaviour. You can see yourself if your performance is affected by looking into your event logs:

EventID 4226: TCP/IP has reached the security limit imposed on the number of concurrent TCP connect attempts

A patcher has been developed to adjust this limit.

References

-- TobyRodwell - 28 Feb 2005 (initial version)
-- SimonLeinen - 06 Apr 2005 (removed dead pointers; added new ones; descriptions)
-- AlexGall - 17 Jun 2005 (added send window information)
-- HankNussbacher - 19 Jun 2005 (added new MS webcast for 2003 Server)
-- SimonLeinen - 12 Sep 2005 (added information and pointer about new TCP/IP in Vista)
-- HankNussbacher - 22 Oct 2006 (added Cisco IOS firewall upgrade)
-- AlexGall - 31 Aug 2007 (replaced IOS issue with a link to WindowScalingProblems, added info for scaling limitation in Vista)
-- SimonLeinen - 04 Nov 2007 (information on how to enable Compound TCP on Vista)
-- PekkaSavola - 05 Jun 2008 major reorganization to improve the clarity wrt send buffer adjustment, add a link to DrTCP; add XP SP3 connection limiting discussion -- ChrisWelti - 02 Feb 2009 added new information regarding the send window in windows vista / server 2008 / 7

Network Adapter and Driver Issues

One aspect that causes many performance problems is adapater and NIC compatability issues (full vs half duplex as an example. The document "Troubleshooting Cisco Catalyst Switches to NIC Compatibility Issues" from Cisco covers many vendor NICs.

Performance Impacts of Host Architecture

The host architecture has impact on performance especially with higher rates. Important factors include:

Performance-Friendly Adapter and Driver Features

There are several different techniques that enhance network performance by moving critical functions to the network adapter. Those techniques typically (but with the notable exception of GSO) require both special hardware support in the adapter and support at the device driver level to make use of that special hardware.

References

-- SimonLeinen - 2005-02-02 - 2015-01-28
-- PekkaSavola - 2006-11-15

-- SimonLeinen - 09 Apr 2006

 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2004-2009 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.