Chapter 1: Introduction
The TCP/IP protocol suite forms the basis of the Internet (a wide area network or WAN).
TCP/IP is considered a 4-layer system (networking protocols are usually developed in layers).
Each layer has a different responsibility:
- the link layer (also known as the data-link layer or network interface layer) normally includes the device driver in the operating system and the corresponding network interface card in the computer.
- the network layer (also called internet layer) handles movement of packets around the network. IP (internet protocol), ICMP (Internet Control Message Protocol), and IGMP (internet group management protocol) provide the network layer in the TCP/IP protocol suite.
- the transport layer provides flow of data between two hosts for use for the application layer above. two different protocols: TCP (transmission control protocol) and UDP (user datagram protocol).
- the application layer handles the details for the particular application.
Every interface on an internet must have a unique internet address (also called an IP address). These addresses are 32-bit numbers.
The domain name system (DNS) is a distributed database that provides the mapping between ip addresses and hostnames.
Encapsulation: when an application sends data using TCP, the data is sent down the protocol stack, through each layer, until it is sent as a stream of bits across the network. Each layer adds information to the data by prepending headers (and sometimes adding trailer information) to the data that it receives. The unit of data that TCP sends to IP is called a TCP segment. The unit of data that IP sends to the network interface is called an IP datagram. The stream of bits that flows across the Ethernet is called a frame.
Demultiplexing: when an Ethernet frame is received at the destination host it starts its way up the protocol stack and all the headers are removed by the appropriate protocol box. Each protocol box looks at certain identifiers in its header to determine which box in the next upper layer receives the data. This is called demultiplexing.
The de facto standard for TCP/IP implementations is the one from the computer systems research group at the University of California at Berkeley. Historically this has been distributed with the 4.x BSD system (Berkeley Software Distribution), and with the "BSD networking releases."
Two popular application programming interfaces (APIs) for applications using the TCP/IP protocols are called sockets and TLI (transport layer interface). The former is sometimes called "Berkeley sockets,".
Chapter 2: Link Layer
The loopback interface allows a client and a server on the same host to communicate with each other using TCP/IP.
MTU refers to the maximum transmission unit: the limit on the size of the frame for encapsulation. Path MTU refers to the maximum of the network.
netstat -in provides information on the interfaces on the system.
Chapter 3: Internet Protocol (IP)
All TCP, UDP, ICMP, and IGMP data gets transmitted as IP datagrams.
IP provides an unreliable, connectionless datagram delivery service.
Unreliable: it does not guarantee successful delivery of an IP datagram, rather IP provides a best effort service. Any required reliability must be provided by a higher layer, for example TCP.
Connectionless: IP does not contain state information about successive datagrams. Datagrams can be delivered out of order.
IP uses big endian byte ordering (also called network byte ordering). Systems that use little endian byte ordering must convert header values into big endian or network byte order before transmitting data.
IP routing is simple, especially for a host. If the destination is directly connected to the host (e.g., a point-to-point link) or on a shared network (e.g., Ethernet or token ring), then the IP datagram is sent directly to the destination. Otherwise the host sends the datagram to a default router, and lets the router deliver the datagram to its destination.
The IP layer has a routing table in memory that it searches each time it receives a datagram to send. When a datagram is received from a network interface, IP first checks if the destination IP address is one of its own IP addresses or an IP broadcast address. If so, the datagram is delivered to the protocol module specified by the protocol field in the IP header. If the datagram is not destined for this IP layer, then (1) if the IP layer was configured to act as a router the packet is forwarded (that is, handled as an outgoing datagram as described below), else (2) the datagram is silently discarded.
A host address has a nonzero host ID and identifies one particular host. a network address has a host ID of 0 and identifies all the hosts on that network (e.g., Ethernet, token ring).
All hosts are now required to support subnet addressing. Instead of considering an IP address as just a network ID and host ID, the host ID portion is divided into a subnet ID and a host ID.
The ability to specify a route to a network, and not have to specify a route to every host, is another fundamental feature of IP routing.
A different link-layer header can be used on each link, and the link-layer destination address (if present) always contains the link-layer address of the next hop. Ethernet addresses are normally obtained using Address Resolution Protocol (ARP).
All hosts are now required to support subnet addressing. The host ID portion of the IP address is divided into a subnet ID and a host ID. Whether to subnet or not (after obtaining IP network ID of certain class from the internet) is up to the systems administration.
A subnet mask specifies at boot time how many bits are to be used for the subnet ID and how many bits are for the host ID.
ifconfig to configure or query a network interface for use by TCP/IP.
netstat provides information about the interfaces on the system.
-i for interface information,
-n to print IP addresses instead of hostnames.
Chapter 4: Address Resolution Protocol (ARP)
IP addresses are only relevant to the TCP/IP suite. The link layer has its own addressing scheme, which can be used at the same time as TCP/IP: hosts using TCP/IP and a PC network software can share the same physical Ethernet cable.
When an Ethernet frame is sent from one host on a LAN to another, it is the 48-bit Ethernet address that determines for which interface the frame is destined. The device driver software never looks at the destination IP address in the IP datagram.
The Address Resolution Protocol provides a mapping from 32-bit IP addresses to link layer address (for example 48-bit Ethernet address) while Reverse Address Resolution Protocol (RARP) maps back the other direction from the link layer address to the IP address.
The fundamental concept behind ARP is that the network interface has a hardware address (a 48-bit value for an Ethernet or token ring interface). Frames exchanged at the hardware level must be addressed to the correct interface. But TCP/IP works with its own addresses: 32-bit IP addresses. Knowing a host's IP address doesn't let the kernel send a frame to that host. The kernel (i.e., the Ethernet driver) must know the destination's hardware address to send it data. The function of ARP is to provide a dynamic mapping between 32-bit IP addresses and the hardware addresses used by various network technologies.
the ARP cache can be examined with
-a shows all entries.
tcpdump can be used to examine packets.
Chapter 5: Reverse Address Resolution Protocol (RARP)
A system without a disk needs a way to obtain its IP address (usually this is obtained at boot from a configuration file).
Each system on a network has a unique hardware address, assigned by the manufacturer of the network interface. The principle of RARP is for the diskless system to read its unique hardware address from the interface card and send an RARP request (a broadcast frame on the network) asking for someone to reply with the diskless system's IP address (in an RARP reply).
Chapter 6: Internet Control Message Protocol (ICMP)
ICMP is often considered part of the IP layer. It communicates error messages and other conditions that require attention. ICMP messages are usually acted on by either the IP layer or the higher layer protocol (TCP or UDP). Some ICMP messages cause errors to be returned to user processes.
There are 15 different values for the type field, which identify the particular ICMP message. Some types of ICMP messages then use different values of the code field to further specify the condition.
Chapter 7: Ping Program
Ping tests whether another host is reachable. It sends an ICMP echo request message to a host, expecting an ICMP echo reply to be returned. Ping also measures the round-trip time to host.
The server echoes the identifier and sequence number fields, and any additional data sent by the client and of interest.
On some systems (bsd/386 version 0.9.4 for example) time may be listed as 0 ms because the available timer only provides 10 ms accuracy.
-r sets the IP record route (RR) in the outgoing IP datagram which contains the ICMP echo request message. this option records the route the datagram takes to its destination.
This functionality can also be obtained used
Chapter 8: Traceroute Program
Traceroute lets us see the route that IP datagrams follow from one host to another. Traceroute also lets us use the IP source route option.
The program sends UDP datagrams starting with TTL 1, increasing the TTL by 1 to locate each router in the path. An ICMP time exceeded is returned by each router when it discards the UDP datagram, and an ICMP port unreachable is generated by the final destination.
-g implement loose and strict source routing, to specify a forced route.
Chapter 9: IP Routing
Datagrams to be routed can be generated on the local host or on some other host (must be configured to act as a router otherwise datagrams received are dropped).
The routing table can be updated by
route command or when ICMP redirect messages are received. The information contained in the routing table drives all decisions made by IP. The kernal maintains the routing table.
The steps IP performs when searching its routing table:
- search for a matching host address
- search for a matching network address
- search for a default entry (usually specifed as a network entry with a network ID of 0)
IP performs the routing mechanisism (searching the routing table to decide which interface to send a packet out) while routing policy is set by a routing daemon.
netstat -rn // -r displays the routing table, -n prints IP addresses in numeric format not names.
u the route is up.
g the route is a gateway(route). if not set, the destination is directly connected.
h the route is to a host - the destination is a complete host. If not set, the destination is to a network and the destination is a network address.
d the route was created by ICMP redirect.
m the route was modified by ICMP redirect.
Use displays the number of packets sent through that route.
lo0 is the loopback interface.
Whenever an interface is initialised a direct route is automatically created for that interface. Routes to hosts or networks must be entered into the routing table. The
route command may be executed from initialisation files at system boot.
route add <destination> <gateway/router> 1 // final argument is routing metric: if 0 set without g flag, if greater than 0 set with g flag.
Routing tables are also initialised by running routing daemon or via routing discovery protocol.
The ICMP "host unreachable" error message is sent by a router when it receives an IP datagram that it cannot deliver or forward.
Hosts are not supposed to forward IP datagrams unless they have been specifically configured as a router. On unix systems this is via the
ipforwarding kernal variable.
ICMP redirect errors allow a host with minimal routing knowledge to build up a better routing table over time.
Chapter 10: Dynamic Routing Protocols
There are two basic types of routing protocols: Interior Gateway Protocols (IGPs), for routers within an autonomous system, and Exterior Gateway Protocols (EGPs), for routers to communicate with routers in other autonomous systems.
Dynamic Routing Protocols are used by routers to communicate with each other. The most common is the Routing Information Protocol (RIP). The process running the routing protocol is called a routing daemon.
Dynamic routing occurs when routers talk to adjacent routers, informing each other of what networks each router is currently connected to.
Many different routing protocols are used on the internet. The internet is organized into a collection of autonomous systems (ASs), each of which is normally administered by a single entity. A corporation or university campus often defines an autonomous system.
The unix routing daemon is usually
routed. This daemon uses RIP exclusively and is intended for small to medium sized networks. An alternative is
gated which supports IGPs and EGPs.
RIP is usually UDP port 520.
ripquery can be used to query routers for their routing table.
Open Shortest Path First: OSPF is a newer alternative to RIP as an interior gateway protocol. It overcomes all the limitations of RIP.
Border Gateway Protocol: BGP is an exterior gateway protocol for communication between routers in different autonomous systems.
Chapter 11: User Datagram Protocol (UDP)
UDP is a simple, datagram-oriented, transport layer protocol: each output operation by a process produces exactly one UDP datagram, which causes one IP datagram to be sent.
This is different from a stream-oriented protocol such as TCP where the amount of data written by an application may have little relationship to what actually gets sent in a single IP datagram.
UDP provides no reliability: it sends the datagrams that the application writes to the IP layer, but there is no guarantee that they ever reach their destination.
The application needs to worry about the size of the resulting IP datagram. If it exceeds the network's MTU (path MTU) the IP datagram is fragmented.
The UDP checksum covers the UDP header and the UDP data. The checksum is optional in UDP (not TCP).
If the sender did compute a checksum and the receiver detects a checksum error, the UDP datagram is silently discarded. no error message is generated. (This is what happens if an IP header checksum error is detected by IP.)
This UDP checksum is an end-to-end checksum. It is calculated by the sender, and then verified by the receiver. It is designed to catch any modification of the UDP header or data anywhere between the sender and receiver.
Checksums should always be enabled. To check if a receiver has checksum enabled, use
tcpdump after sending a single UDP datagram with
There is no communication between the sender and receiver before the first datagram is sent (unlike TCP). Also, there are no acknowledgments by the receiver when the data is received (unlike TCP).
The physical network layer normally imposes an upper limit on the size of the frame that can be transmitted. Whenever the IP layer receives an IP datagram to send, it determines which local interface the datagram is being sent on (routing), and queries that interface to obtain its MTU. IP compares the MTU with the datagram size and performs fragmentation, if necessary. Fragmentation can take place either at the original sending host or at an intermediate router.
When an IP datagram is fragmented, it is not reassembled until it reaches its final destination.
If one fragment is lost the entire datagram must be retransmitted. IP itself has no timeout and retransmission is the responsibility of the higher layers. (TCP performs timeout and retransmission, UDP doesn't).
Also note the terminology: an IP datagram is the unit of end-to-end transmission at the IP layer (before fragmentation and after reassembly), and a packet is the unit of data passed between the IP layer and the link layer. a packet can be a complete IP datagram or a fragment of an IP datagram.
ICMP unreachable error (fragmentation required): another variation of the ICMP unreachable error occurs when a router receives a datagram that requires fragmentation, but the don't fragment (df) flag is turned on in the IP header.
Theoretically, the maximum size of an IP datagram is 65535 bytes, imposed by the 16-bit total length field in the IP header. With an IP header of 20 bytes and a UDP header of 8 bytes, this leaves a maximum of 65507 bytes of user data in a UDP datagram. most implementations, however, provide less than this maximum.
UDP programming interfaces allow the application to specify the maximum number of bytes to return each time. What happens if the received datagram exceeds the size the application is prepared to deal with? Unfortunately this depends on the implementation. The traditional Berkeley version of the sockets API truncates the datagram, discarding any excess data.
Chapter 12: Broadcasting and Multicasting
Broadcasting and multicasting only apply to UDP, where it makes sense for an application to send a single message to multiple recipients. TCP is a connection-oriented protocol that implies a connection between two hosts (specified by IP addresses) and one process on each host (specified by port numbers).
There are times, however, when a host wants to send a frame to every other host on the cable called a broadcast. We saw this with ARP and RARP. Multicasting fits between unicasting and broadcasting: the frame should be delivered to a set of hosts that belong to a multicast group.
The limited broadcast address is 255.255.255.255.
What routers and hosts do with broadcasting depends on the type of broadcast address, the application, the TCP/IP implementation, and possible configuration switches.
IP multicasting provides to services to applications: delivery to multiple destinations (with TCP these applications would deliver a separate copy to each destination); and solicitation of servers by clients.
Multicast group addresses are a separate class of IP address. A multicast group address is the combination of the high-order 4 bits of 1110 and the multicast group ID. These are normally written as dotted-decimal numbers and are in the range 126.96.36.199 through 188.8.131.52.
The set of hosts listening to a particular IP multicast address is called a host group. A host group can span multiple networks. Membership in a host group is dynamic hosts may join and leave host groups at will. There is no restriction on the number of hosts in a group, and a host does not have to belong to a group to send a message to that group.
This allocation allows for 23 bits in the Ethernet address to correspond to the IP multicast group ID. The mapping places the low-order 23 bits of the multicast group ID into these 23 bits of the Ethernet address. Since the mapping is not unique, it implies that the device driver or the IP module 12.1 must perform filtering.
Multicasting on a single physical network is simple. The sending process specifies a destination IP address that is a multicast address, the device driver converts this to the corresponding Ethernet address, and sends it. The receiving processes must notify their IP layers that they want to receive datagrams destined for a given multicast address, and the device driver must somehow enable reception of these multicast frames. This is called "joining a multicast group."
Chapter 13: Internet Group Management Protocol (IGMP)
IGMP is used by hosts and routers that support multicasting to let all systems on a physical network know which hosts currently belong to which multicast groups.
Chapter 14: the Domain Name System (DNS)
The Domain Name System is a distributed database used by TCP/IP applications to map between hostnames and IP addresses, and to provide electronic mail routing information. The database is distributed because no single site on the internet knows all the information.
From an application's point of view, access to the DNS is through a resolver. On unix hosts the resolver is accessed primarily through two library functions,
gethostbyaddr(3), which are linked with the application when the application is built. The resolver is normally part of the application. It is not part of the operating system kernel as are the TCP/IP protocols. An application must convert a hostname to an IP address before it can ask TCP to open a connection or send a datagram using UDP. The TCP/IP protocols within the kernel know nothing about the DNS.
The most commonly used implementation of the DNS, both resolver and name server, is called BIND (the Berkeley Internet Name Domain) server. The server is called
A domain name that ends with a period is called an absolute domain name or a fully qualified domain name (FQDN). If the domain name does not end with a period, it is assumed that the name needs to be completed.
The delegation of responsibility within the DNS: no single entity manages every label in the DNS tree. instead, one entity (the NIC) maintains a portion of the tree (the top-level domains) and delegates responsibility to others for specific zones.
Once the authority for a zone is delegated, it is up to the person responsible for the zone to provide multiple name servers for that zone. Whenever a new system is installed in a zone, the DNS administrator for the zone allocates a name and an IP address for the new system and enters these into the name server's database.
What does a name server do when it doesn't contain the information requested? it must contact another name server.
The most common query type is an a type, which means an IP address is desired for the query name. a PTR query requests the names corresponding to an IP address. Oort 53 is the common name server port. A and PTR are known as resource records (RRs).
host queries to a name server to ascertain IP address.
nslookup may also be used.
dig is another.
Pointer queries (PTR query) return the name (or names) corresponding to that address. FQDNs are written from bottom of the tree up (for example sun.tuc.noao.edu.), while the DNS name for a host is written starting from the bottom, working upwards (for example 184.108.40.206.in-addr.arpa.). If there was not a separate branch of the DNS tree for handling this address-to-name translation, there would be no way to do the reverse translation other than starting at the root of the tree and trying every top-level domain.
Some servers (for example anonymous FTP) require the client's IP address to have a pointer record in the DNS. Other servers, such as the Rlogin server not only require that the client's IP address have a pointer record, but then ask the DNS for the IP addresses corresponding to the name returned in the PTR response, and require that one of the returned addresses match the source IP address in the received datagram.
There are about 20 types of RRs. for example:
- CNAME: canonical name and refers to an alias.
- MX: mail exchange record.
- NS: name server record
All name servers use a cache to reduce traffic on the internet. The unix implementation, the cache is maintained in the server not the resolver. If the
host command states a non-authoratitive answer, it has been obtained from the cache.
Generally UDP is used by DNS, however TCP may be used if the size of response exceeds 512 bytes.
Chapter 15: Trivial File Transfer Protocol (TFTP)
TFTP is intended to be used when bootstrapping diskless systems. unlike the file transfer protocol (ftp) which uses TCP, TFTP was designed to use UDP, to make it simple and small. implementations of TFTP (and its required UDP, IP, and a device driver) can fit in read-only memory.
TFTP is an example of a stop-and-wait protocol. Each data packet contains a block number that is later used in an acknowledgment packet.
Chapter 16: Bootstrap Protocol (BOOTP)
BOOTP uses UDP and normally works in conjunction with TFTP. There are two well-known ports for BOOTP: 67 for the server and 68 for the client.
Chapter 17: Transmission Control Protocol (TCP)
Even though TCP and UDP use the same network layer (IP), TCP provides a totally different service to the application layer than UDP does. TCP provides a connection-oriented, reliable, byte stream service.
Connection-oriented: the two applications using TCP (normally considered a client and a server) must establish a TCP connection with each other before they can exchange data.
TCP provides reliability by the following:
- application data is broken into what TCP considers the best sized chunks to send. This is totally different from UDP, where each write by the application generates a UDP datagram of that size. The unit of information passed by TCP to IP is called a segment.
- When TCP sends a segment it maintains a timer, waiting for the other end to acknowledge reception of the segment. If an acknowledgment isn't received in time, the segment is retransmitted.
- When TCP receives data from the other end of the connection, it sends an acknowledgment.
- TCP maintains a checksum on its header and data. This is an end-to-end checksum whose purpose is to detect any modification of the data in transit. If a segment arrives with an invalid checksum, TCP discards it and doesn't acknowledge receiving it.
- Since TCP segments are transmitted as IP datagrams, and since IP datagrams can arrive out of order, TCP segments can arrive out of order. A receiving TCP resequences the data if necessary, passing the received data in the correct order to the application.
- Since IP datagrams can get duplicated, a receiving TCP must discard duplicate data.
- TCP also provides flow control. Each end of a TCP connection has a finite amount of buffer space. A receiving TCP only allows the other end to send as much data as the receiver has buffers for. This prevents a fast host from taking all the buffers on a slower host.
A stream of 8-bit bytes is exchanged across the TCP connection between the two applications. There are no record markers automatically inserted by TCP. This is what we called a byte stream service. If the application on one end writes 10 bytes, followed by a write of 20 bytes, followed by a write of 50 bytes, the application at the other end of the connection cannot tell what size the individual writes were. The other end may read the 80 bytes in four reads of 20 bytes at a time. One end puts a stream of bytes into TCP and the same, identical stream of bytes appears at the other end.
Also, TCP does not interpret the contents of the bytes at all. TCP has no idea if the data bytes being exchanged are binary data, ASCII characters, EBCDIC characters, or whatever. The interpretation of this byte stream is up to the applications on each end of the connection.
This treatment of the byte stream by TCP is similar to the treatment of a file by the Unix operating system. The Unix kernel does no interpretation whatsoever of the bytes that an application reads or write that is up to the applications. There is no distinction to the Unix kernel between a binary file or a file containing lines of text.
Each TCP segment contains the source and destination port number to identify the sending and receiving application. These two values, along with the source and destination IP addresses in the IP header, uniquely identify each connection.
The combination of an IP address and a port number is sometimes called a socket (a term used in the Berkeley derived programming interface). It is the socket pair (the 4-tuple consisting of the client IP address, client port number, server IP address, and server port number) that specifies the two end points that uniquely identifies each TCP connection in an internet.
The sequence number identifies the byte in the stream of data from the sending TCP to the receiving TCP that the first byte of data in this segment represents. If we consider the stream of bytes flowing in one direction between two applications, TCP numbers each byte with a sequence number.
Since every byte that is exchanged is numbered, the acknowledgment number contains the next sequence number that the sender of the acknowledgment expects to receive. This is therefore the sequence number plus 1 of the last successfully received byte of data.
TCP provides a full-duplex service to the application layer. This means that data can be flowing in each direction, independent of the other direction.
There are six flag bits in the TCP header.
- URG: the urgent pointer is valid
- ACK: the acknowledgment number is valid
- PSH: the receiver should pass this data to the application as soon as possible
- RST: reset the connection
- SYN: synchronize sequence numbers to initiate a connection
- FIN: the sender is finished sending data
The data portion of the segment is optional. On connection and termination segments are exchanged without data.
Chapter 18: TCP Connection Establishment and Termination
TCP is a connection-oriented protocol. Before either end can send data to the other, a connection must be established between them.
tcpdump abbreviates the flags S (SYN), F (FIN), R (RST), P (PSH) and . (no flags).
1415531521:1415531521 (0) means the sequence number of the packet was 1415531521 and the number of data bytes in the segment was 0.
win shows the window size.
<mss 1024> shows the maximum segment size (MSS) option specified by the sender. The sender does not want to receive TCP segments larger than this value.
To establish a TCP connection:
- The requesting end (normally called the client) sends a SYN segment specifying the port number of the server that the client wants to connect to, and the client's initial sequence number (ISN, 1415531521 in this example). This is segment 1.
- The server responds with its own SYN segment containing the server's initial sequence number (segment 2). The server also acknowledges the client's SYN by ACKing the client's ISN plus one. A SYN consumes one sequence number.
- The client must acknowledge this SYN from the server by ACKing the server's ISN plus one (segment 3).
Three segments complete the connection establishment. This is often called the three-way handshake.
It takes four segments to terminate a connection. Since a TCP connection is full-duplex (that is, data can be flowing in each direction independently of the other direction), each direction must be shut down independently. Either end can send a FIN when it is done sending data.
The receipt of a FIN only means there will be no more data flowing in that direction. A TCP can still send data after receiving a FIN. While it's possible for an application to take advantage of this half-close, in practice few TCP applications use it.
Most Berkeley-derived systems set a time limit of 75 seconds on the establishment of a new connection.
The maximum segment size (MSS) is the largest "chunk" of data that TCP will send to the other end. When a connection is established, each end can announce its MSS. The resulting IP datagram is normally 40 bytes larger: 20 bytes for the TCP header and 20 bytes for the IP header.
If the destination IP address is "nonlocal," the MSS normally defaults to 536.
TCP provides the ability for one end of a connection to terminate its output, while still receiving data from the other end. This is called a half-close.
The TIME_WAIT state is also called the 2MSL wait state. Every implementation must choose a value for the maximum segment lifetime (MSL). It is the maximum amount of time any segment can exist in the network before being discarded.
Given the MSL value for an implementation, the rule is: when TCP performs an active close, and sends the final ACK, that connection must stay in the TIME_WAIT state for twice the MSL. This lets TCP resend the final ACK in case this ACK is lost (in which case the other end will time out and retransmit its final FIN).
RFC 793 states that TCP should not create any connections for MSL seconds after rebooting. This is called the quiet time.
A RST "reset" is sent by TCP whenever a segment arrives that doesn't appear correct for the referenced connection.
A common case for generating a reset is when a connection request arrives and no process is listening on the destination port. In the case of UDP, an ICMP port unreachable was generated when a datagram arrived for a destination port that was not in use. TCP uses a reset instead.
It is possible to abort a connection by sending a reset instead of a FIN. This is sometimes called an abortive release. Aborting a connection provides two features to the application: (1) any queued data is thrown away and the reset is sent immediately, and (2) the receiver of the RST can tell that the other end did an abort instead of a normal close. The API being used by the application must provide a way to generate the abort instead of a normal close.
A TCP connection is said to be half-open if one end has closed or aborted the connection without the knowledge of the other end. This can happen any time one of the two hosts crashes. As long as there is no attempt to transfer data across a half-open connection, the end that's still up won't detect that the other end has crashed.
It is possible, although improbable, for two applications to both perform an active open to each other at the same time. Each end must transmit a SYN, and the SYNs must pass each other on the network. It also requires each end to have a local port number that is well known to the other end. This is called a simultaneous open.
The TCP header can contain options, such as the MSS.
Chapter 19: TCP Interactive Data Flow
TCP segments may contain bulk data (for example FTP or email) or interactive data (for example Telnet or RLogin).
Each interactive keystoke may be sent from the client to the server one byte at a time, or each line (depending on application config). Sometimes the remote system (server) will echo the character. This would generate four segments including the ACKs.
Chapter 20: TCP Bulk Data Flow
TCP uses a form of flow control called a sliding window protocol. It allows the sender to transmit multiple packets before it stops and waits for an acknowledgment.
With TCP, the ACKs are cumulative, they acknowledge that the receiver has correctly received all bytes up through the acknowledged sequence number minus one. An ACK may acknowledge 2 or more segments.
Three terms are used to describe the movement of the right and left edges of the sliding window:
- The window closes as the left edge advances to the right. This happens when data is sent and acknowledged.
- The window opens when the right edge moves to the right, allowing more data to be sent. This happens when the receiving process on the other end reads acknowledged data, freeing up space in its TCP receive buffer.
- The window shrinks when the right edge moves to the left. The Host Requirements RFC strongly discourages this, but TCP must be able to cope with a peer that does this.
If the left edge reaches the right edge, it is called a zero window. This stops the sender from transmitting any data.
The size of the window offered by the receiver can usually be controlled by the receiving process. This can affect the TCP performance.
The sockets API allows a process to set the sizes of the send buffer and the receive buffer. The size of the receive buffer is the maximum size of the advertised window for that connection. Some applications change the socket buffer sizes to increase performance.
The PUSH flag is a notification from the sender to the receiver for the receiver to pass all the data that it has to the receiving process. This data would consist of whatever is in the segment with the PUSH flag, along with any other data the receiving TCP has collected for the receiving process.
Most Berkeley-derived implementations automatically set the PUSH flag if the data in the segment being sent empties the send buffer. This means we normally see the PUSH flag set for each application write, because data is usually sent when it's written.
TCP is now required to support an algorithm called slow start. It operates by observing that the rate at which new packets should be injected into the network is the rate at which the acknowledgments are returned by the other end.
Slow start adds another window to the sender's TCP: the congestion window, called cwnd. When a new connection is established with a host on another network, the congestion window is initialized to one segment (i.e., the segment size announced by the other end). Each time an ACK is received, the congestion window is increased by one segment. (cwnd is maintained in bytes, but slow start always increments it by the segment size.) The sender can transmit up to the minimum of the congestion window and the advertised window. The congestion window is flow control imposed by the sender, while the advertised window is flow control imposed by the receiver.
TCP provides what it calls urgent mode, allowing one end to tell the other end that "urgent data" of some form has been placed into the normal stream of data. The other end is notified that this urgent data has been placed into the data stream, and it's up to the receiving end to decide what to do.
The notification from one end to the other that urgent data exists in the data stream is done by setting two fields in the TCP header. The URG bit is turned on and the 16-bit urgent pointer is set to a positive offset that must be added to the sequence number field in the TCP header to obtain the sequence number of the last byte of urgent data.
Unfortunately many implementations incorrectly call TCP's urgent mode out-of-band data. The confusion between TCP's urgent mode and out-of-band data is also because the predominant programming interface, the sockets API, maps TCP's urgent mode into what sockets calls out-of-band data.
Chapter 21: TCP Timeout and Retransmission
TCP provides a reliable transport layer. One of the ways it provides reliability is for each end to acknowledge the data it receives from the other end. But data segments and acknowledgments can get lost. TCP handles this by setting a timeout when it sends data, and if the data isn't acknowledged when the timeout expires, it retransmits the data.
TCP manages four different timers for each connection:
- A retransmission timer is used when expecting an acknowledgment from the other end.
- A persist timer keeps window size information flowing even if the other end closes its receive window.
- A keepalive timer detects when the other end on an otherwise idle connection crashes or reboots.
- A 2MSL timer measures the time a connection has been in the TIME_WAIT state.
Fundamental to TCP's timeout and retransmission is the measurement of the round-trip time (RTT) experienced on a given connection.
When TCP times out and retransmits, it does not have to retransmit the identical segment again. Instead, TCP is allowed to perform repacketization, sending a bigger segment, which can increase performance.
Chapter 22: TCP Persist Timer
If an acknowledgment is lost, we could end up with both sides waiting for the other: the receiver waiting to receive data (since it provided the sender with a nonzero window) and the sender waiting to receive the window update allowing it to send. To prevent this form of deadlock from occurring the sender uses a persist timer that causes it to query the receiver periodically, to find out if the window has been increased.
Chapter 23: TCP Keepalive Timer
No data flows across an idle TCP connection. As long as neither connected host reboots, the connection remains established. The keepalive timer provides a server with the capability to know if the client's host has either crashed and is down, or crashed and rebooted. If there is no activity on a given connection for 2 hours, the server sends a probe segment to the client (the client may also implement keep alive).
Keep alive is not in the TCP spec. The keepalive feature is controversial. Protocol experts continue to debate whether it belongs in the transport layer, or should be handled entirely by the application.
It operates by sending a probe packet across a connection after the connection has been idle for 2 hours. Four different scenarios can occur: the other end is still there, the other end has crashed, the other end has crashed and rebooted, or the other end is currently unreachable.
Chapter 24: TCP Futures and Performance
Chapter 25: Simple Network Management Protocol (SNMP)
Chapter 26: Telnet and Rlogin
Chapter 27: File Transfer Protocol (FTP)
Chapter 28: Simple Mail Transfer Protocol (SMTP)
The exchange of mail using TCP is performed by a Message Transfer Agent (MTA). The most common MTA for Unix systems is Sendmail. SMTP is how to MTAs communicate with each other across a single TCP connection.