This course contains the basics of Linux

Course introduction
Test Your Caliber
Interview Questions
Pragnya Meter Exam


Sending the Data from the Socket through UDP and TCP

In this, we look at what happens in the transport layer as data is transmitted. When a user writes data into an open socket, socket buffers are allocated by the transport layer and travel through the transport layer to IP where they are routed and passed to the device drivers for sending. Specifying SOCK_DGRAM in the socket call invokes the UDP protocol, and specifying SOCK_STREAM invokes the TCP protocol. For SOCK_DGRAM type sockets, the process is relatively simple, but it is far more complicated for SOCK_STREAM type sockets. We will examine both UDP and TCP and look at the functions that interface the protocol to the socket layer. Next, we will focus on the sendmsg function for each of the protocols. We will follow the data as it flows through the transport layer.

Socket Layer Glue

Before we start looking at the internals of each of the transport layer protocols, we should look at how service functions in the transport layer protocols are associated with the socket layer functions. Through this mechanism, the application program is able to direct the actions of the transport layer for each of the socket types, SOCK_STREAM and SOCK_DGRAM.As explained in "Linux Sockets," each of the two transport protocols is registered with the socket layer.

The Proto Structure

The key to this registration process is the data structure, proto, which is defined in linux/include/linux/sock.h. Most of the fields in the proto structure are function pointers. They each correspond to specific functions for each transport protocol. A transport protocol does not have to implement every function; for example, UDP does not have a shutdown function. UDP and TCP do implement most of the functions in the proto structure. The first seven functions, close through shutdown, are described in Sections for UDP and for TCP.

struct proto {
void (*close)(struct sock *sk,long timeout);
int (*connect)(struct sock *sk,struct sockaddr
*uaddr, int addr_len);
int (*disconnect)(struct sock *sk, int  flags);
struct sock * (*accept) (struct sock *sk,
int  flags, int *err);
int (*ioctl)(struct sock *sk, int cmd,unsigned long arg);
int (*init)(struct sock *sk);
int (*destroy)(struct sock *sk);
void (*shutdown)(struct sock *sk, int how);

The getsockopt and setsockopt functions and options for both UDP and TCP are discussed in detail later in this section.

int (*setsockopt)(struct sock *sk, int level,
int optname, char *optval,int optlen);
int (*getsockopt)(struct sock *sk, int level,
int optname, char *optval,int *option);

The sendmsg function is discussed in Section for UDP and TCP.

int (*sendmsg)(struct sock *sk,struct msghdr *msg, int len);

Recvmsg is covered in "Receiving the Data in the Transport Layer, UDP and TCP."

int (*recvmsg)(struct sock *sk,struct msghdr *msg, int len,
int noblock, int flags,int *addr_len);

The bind function is not implemented by either TCP or UDP within the transport protocols themselves. Instead, it is implemented at the socket layer, covered in "Linux Sockets and Socket Layer Programming."

int(*bind)(struct sock *sk,struct sockaddr *uaddr,int addr_len);

Backlog_rcv is implemented by TCP. Refer to, "Receiving Data in the Transport Layer, UDP and TCP," to see what happens when backlog_rcv is executed.

int(*backlog_rcv)(struct sock *sk,struct sk_buff *skb);

The hash and unhash functions are for manipulating hash tables. These tables are for associating the endpoint addresses (port numbers) with open sockets. The tables map transport protocol port numbers to instances of struct sock. Hash places a reference to the sock structure, sk, in the hash table.

void (*hash)(struct sock *sk);

Unhash removes the reference to sk from the hash table.

void (*unhash)(struct sock *sk);

Get_port returns the port associated with the sock structure, sk. Generally, the port is obtained from one of the protocol’s port hash tables.

int (*get_port)(struct sock *sk,unsigned short snum);

This field contains the name of the protocol, either “UDP” or “TCP”.

char name[32];
struct {
int inuse;
u8 __pad[SMP_CACHE_BYTES - sizeof(int)];
} stats[NR_CPUS];
} ;

Neither UDP nor TCP implement all of the functions in the proto structure. As we saw in the AF_INET family provides pointers to default functions that get called from the socket layer in the case where the specific transport protocol doesn’t implement a particular function. Each of the transport protocols registers a set of functions by initializing a data structure of type struct proto, defined in the file sock.h.

The Msghdr Structure

All the socket layer read and write functions are translated into calls to either rcvmsg or sendmsg, a BSD type message communication method. Internally in the socket layer, the internal functions use the msghdr structure, defined in file linux /include /linux /socket.h, to pass data to and from the underlying protocols.

struct msghdr {

Msg_name field is also known as the socket "name" or the destination address for this message. Generally, this field is cast into a pointer to a sockaddr_in. The msg_namelen field is the address length of the msg_name.

void * msg_name;
int msg_namelen;

Msg_iovec points to an array of data blocks passed either to the kernel from the application or from the kernel to the application. Msg_iovlen holds the number of data blocks pointed to by msg_iov. The msg_control field is for the BSD style file descriptor passing. Msg_controllen is the number of messages in the control message structure.

struct iovec * msg_iov;
__kernel_size_t msg_iovlen;
void * msg_control;
__kernel_size_t msg_controllen;
unsigned msg_flags;
} ;

UDP Socket Glue

As we saw in the transport protocols register with the socket layer by adding a pointer to a proto structure. UDP creates an instance of struct proto at compile time in the file linux/net/ipv4/udp.c and initializes it with values from Table.

Protocol Block Functions for UDP, Struct proto

The UDP protocol is invoked when the application layer specifies SOCK_DGRAM in the type field of the socket call. SOCK_DGRAM type sockets are fairly simple. There is no connection management or buffering. A call to one of the send functions in the application layer causes the data to be sent out immediately as a single datagram. Table shows the UDP protocol functions mapped to each of the fields in the proto structure described earlier.

TCP Socket Glue

Like UDP, TCP registers a set of functions with the socket layer. As in UDP, this is done at compile time by initializing tcp_prot with the functions shown in Table in the file linux/net/ipv4/tcp_ipv4.c. Tcp_prot is an instance of the proto structure and is initialized with the function pointers shown in Table.

Protocol Block Functions for TCP, Struct proto

Socket Options for TCP

In general, TCP is very configurable. The discussion of the internals of the TCP protocol later in this chapter and refer to various options and how they affect the performance or operation of the protocol. Section shows the TCP options structure that holds the values of many of the socket options. However, in this section, the TCP socket options and ioctl configuration options are gathered together in one place. Although, most of these are covered in some fashion in the tcp(7) man page, this section lists applicable internal constants and internal variables as well as any references to other sections in the text. The following options are set via the setsockopt system call or read back with the getsockopt system call.

TCP_CORK: If this option is set,TCP doesn’t send out frames until there is enough data to fill the maximum segment size. It allows the application to stop transmission if the route MTU is less than the Minimum Segment Size (MSS). This option is unique to Linux, and application code using it will not be portable to other operating systems (OSs). This option is held in the nonagle field in the TCP options structure, which is set to the number two. TCP_CORK is mutually exclusive with the TCP_NODELAY option

TCP_DEFER_ACCEPT: The application caller may sleep until data arrives at the socket, at which time it is awakened. The socket is also awakened when it times out. The caller specifies the number of seconds to wait for data to arrive. This option is unique to Linux, and application code using it will not be portable to other OSs. The option value is converted to the number of ticks and is kept in the defer_accept field of the TCP option structure.

TCP_INFO: The caller using this option can retrieve lots of configuration information about the socket. This is a Linux-unique option, and code using it will not necessarily be portable to other OSs. The information is returned in the tcp_info structure, defined in file


struct tcp_info

The first field, tcpi,_state, contains the current TCP state for the connection. The other fields in this structure contain statistics about the TCP connection.

__u8 tcpi_state;
__u8 tcpi_ca_state;
__u8 tcpi_retransmits;
__u8 tcpi_probes;
__u8 tcpi_backoff;
__u8 tcpi_options;
__u8 tcpi_snd_wscale : 4, tcpi_rcv_wscale : 4;
__u32 tcpi_rto;
__u32 tcpi_ato;
__u32 tcpi_snd_mss;
__u32 tcpi_rcv_mss;
__u32 tcpi_unacked;
__u32 tcpi_sacked;
__u32 tcpi_lost;
__u32 tcpi_retrans;
__u32 tcpi_fackets;

The following four fields are event time stamps; however, we don’t actually remember when an ack was sent in all circumstances.

__u32 tcpi_last_data_sent;
__u32 tcpi_last_ack_sent;
__u32 tcpi_last_data_recv;
__u32 tcpi_last_ack_recv;

The last fields are TCP metrics, such as negotiated MTU, send threshold, round-trip time, and congestion window.

__u32 tcpi_pmtu;
__u32 tcpi_rcv_ssthresh;
__u32 tcpi_rtt;
__u32 tcpi_rttvar;
__u32 tcpi_snd_ssthresh;
__u32 tcpi_snd_cwnd;
__u32 tcpi_advmss;
__u32 tcpi_reordering;
} ;

TCP_KEEPCNT: By using this option, the caller can set the number of keepalive probes that TCP will send for this socket before dropping the connection. This option is unique to Linux and should not be used in portable code. The field keepalive_probes in the tcp_opt structure is set to the value of this option. For this option to be effective, the socket level option SO_KEEPALIVE must also be set.

TCP_KEEPIDLE: With this option, the caller may specify the number of seconds that the connection will stay idle before TCP starts to send keepalive probe packets. This option is only effective if the socket option SO_KEEPALIVE is also set for this socket. This is also a nonportable Linux option. The value of this option is stored in the keepalive_time field in the TCP options structure. The value is normally set to a default of two hours.

TCP_KEEPINTVL: This option, also a nonportable Linux option, is used to specify the number of seconds between transmissions of keepalive probes. The value of this option is stored in the keepalive_intvl field in the TCP options structure and is initialized to a value of 75 seconds.

TCP_LINGER2: This option may be set to specify how long an orphaned socket in the FIN_WAIT2 state should be kept alive. The option is unique to Linux and therefore is not portable. If the value is set to zero, the option is turned off and Linux uses normal processing for the FIN_WAIT_2 and TIME_WAIT states. One aspect of this option is not documented anywhere; if the value is less than zero, the socket proceeds immediately to the CLOSED state from the FIN_WAIT_2 state without passing through the TIME_WAIT state. The value associated with this option is kept in the linger2 of the tcp_opt structure. The default value is determined by the sysctl, tcp_fin_timeout.

TCP_MAXSEG: This option specifies the maximum segment size set for a TCP socket before the connection is established. The advertised MSS value sent to the peer is determined by this option but won’t exceed the interface’s MTU. The two TCP peers for this connection may renegotiate the segment size. See Section for more details on how MSS is used by tcp_sendmsg.

TCP_NODELAY: When set, this option disables the Nagle algorithm. The value is stored in the nonagle field of the tcp_opt structure. This option may not be used if the option TCP_CORK is set. When TCP_NODELAY is set, TCP will send out data as soon as possible without waiting for enough data to fill a segment.

TCP_QUICKACK: This option may be used to turn off delayed acknowledgment by setting the value to one, or enable delayed acknowledgment by setting to a zero. Delayed acknowledgment is the normal mode of operation for Linux TCP. With delayed acknowledgment, ACKs are delayed until they can be combined with a segment waiting to be sent in the reverse direction. If the value of this option is one, the pingpong field in the ack part of tcp_opt is set to zero, which disables delayed acknowledgment. The TCP_QUICKACK option only temporarily affects the behavior of the TCP protocol. If delayed acknowledgment mode is disabled, it could eventually be "automatically" re-enabled depending on the acknowledgment timeout processing and other factors.

TCP_SYNCNT: The caller may use this option to specify the number of SYN retransmits that should be sent before aborting an attempt to establish a connection. This option is unique to Linux and should not be used for portable code. The value is stored in the syn_retries field of the tcp_opt structure.

TCP_WINDOW_CLAMP: By setting this option, the caller may specify the maximum advertised window size for this socket. The minimum allowed for the advertised window is the value SOCK_MIN_RCVBUF divided by two, which is 128 bytes. The value of this option is held in the window_clamp field of tcp_opt for this socket.

Transport Layer Socket Initialization

In this section, we cover how the transport protocols are initialized when a socket is created. The proto structure contains the mapping from the socket layer generic functions to the protocol specific functions. See Section for more details about the proto structure. Discussed how an AF_INET socket of type SOCK_STREAM is created, and how the function pointed to by the init field is executed by the inet_create function. When a socket of type SOCK_DGRAM is created, the proto structure is filled in with specific values for UDP. UDP is relatively simple and provides only a datagram service that does not need any internal state information, so it does not require any protocol-specific socket initialization. The proto structure for UDP does not map any function to the init field. Therefore, no UDP specific socket initialization is done at socket creation time. However, for TCP, a SOCK_STREAM socket, the init field of this structure is set to point to the function tcp_v4_init_sock at socket initialization time.

TCP Socket Initialization

In this section, we discuss tcp_v4_init_sock, defined in file linux/net/ipv4/tcp_ipv4.c to see how it completes initialization of the SOCK_STREAM type or TCP protocol. Since tcp_v4_init_sock unction is called after the sock structure is created, the sock structure has many fields that are already initialized with the value zero and require no further initialization. For details about the sock structure, Many of the values initialized by this function are fields in the TCP options structure discussed elsewhere.

static int tcp_v4_init_sock(struct sock *sk)

As in most other functions in this chapter, we must obtain a pointer to the TCP options structure.

struct tcp_opt *tp = tcp_sk(sk);

The out_of_order_queue is initialized. Unlike the other queues, this queue is unique to TCP and therefore has not been initialized by the socket layer. The transmit timers are initialized by calling tcp_init_xmit_timers. Refer to the section on TCP timers in this chapter for more information.


The retransmit time, rto, and the medium deviation, mdev, which is for Round Trip Time (RTT) measurement, are set to a value of three seconds.

tp->mdev = TCP_TIMEOUT_INIT;

The send congestion window, cwnd, is initialized to two and it seems strange that it is not zero, but the source code includes the following comment: "So many TCP implementations out there (incorrectly) count the initial SYN frame in their delayed-ACK and congestion control algorithms that we must have the following Band-Aid to talk efficiently to them.—DaveM"

tp->snd_cwnd = 2;

The send slow start threshold, snd_ssthresh is set to the maximum 32 bit number—effectively disabling the slow start algorithm. The send congestion window clamp, snd_cwnd_clamp, is set to the maximum 16-bit value. The field mss_cache is the minimum segment size for TCP and is initialized to 536 as required [RFC 794].

tp->snd_ssthresh = 0x7fffffff;
tp->snd_cwnd_clamp = ~0;
tp->mss_cache = 536;

The reordering field of the TCP options structure is initialized to its configured system control value. The socket state, kept in the state field in the sock structure, is initialized to the closed state.

tp->reordering = sysctl_tcp_reordering;
sk->state = TCP_CLOSE;

The write_space field of the sock structure, sk, is a pointer to a callback function, which is called when buffers are available in the socket’s write queue. It is initialized to point to the function tcp_write_space. The use_write_queue field of the sock structure is set to one to indicate that this protocol,(which is TCP of course) uses the socket’s write queue.

sk->write_space = tcp_write_space;
sk->use_write_queue = 1;

The af_specific field of the TCP options structure is set to a set of AF_INET specific functions used by the TCP protocol, which is covered in the next section.

sk->tp_pinfo.af_tcp.af_specific =  &ipv4_specific;

The sndbuf and rcvbuf fields of the sock structure hold the socket options SO_SNDBUF and SO_RCVBUF, which determine the size of the socket’s send and receive buffers, respectively. They are initialized to the system control values here as the socket is being initialized, but setsockopt may change them later. Tcp_sockets_allocated is a global defined in tcp.c that holds the number of open TCP sockets.

sk->sndbuf = sysctl_tcp_wmem[1];
sk->rcvbuf = sysctl_tcp_rmem[1];
return 0;

The Tcp_func Structure for TCP

As we saw previously, TCP socket initialization includes setting the af_specific pointer in the TCP options part of the sock structure to ipv4_specific, which is a pointer to an instance of a tcp_func structure. The tcp_func structure, defined in file linux/include/linux/tcp.h, contains a set of IPv4-specific functions for TCP that are dependent on the AF_INET address family. Its purpose is to facilitate port sharing between IPv4 and IPv6. Table shows the mapping between these fields and the specific values for TCP, which are initialized as ipv4_specific in filelinux/net/ipv4/tcp_ipv4.c.

IPv4 Specific Values for Tcp_func

struct tcp_func {
int (*queue_xmit) (struct sk_buff *skb);
void (*send_check) (struct sock *sk,
struct tcphdr *th,int len,struct sk_buff *skb);
int (*rebuild_header) (struct sock *sk);
int (*conn_request) (struct sock *sk,struct sk_buff *skb);
struct sock * (*syn_recv_sock) (struct sock  *sk,
struct sk_buff *skb,
struct open_request *req,struct dst_entry *dst);
int (*remember_stamp) (struct sock *sk);
__u16 net_header_len;
int (*setsockopt) (struct sock *sk,int level,
int optname, char *optval, int optlen);
int (*getsockopt) (struct sock *sk,int level,
int optname, char *optval, int *optlen);
void (*addr2sockaddr) (struct sock *sk,struct sockaddr *);
int sockaddr_len;
} ;

Initiating a Connection

As we are aware, UDP is a protocol for sending individual datagrams and doesn’t actually maintain connections between peer hosts. There is no state information maintained between subsequent UDP packet transmissions. In contrast, TCP is connection oriented, and much of the processing associated with TCP is related to the setup and breakdown of connections.

Connections are initiated at the client side by calling the connect socket call. As Table shows, when connect is called from the application code, the tcp_v4_connect function is executed for TCP. Even though UDP doesn’t support connections, the connect socket call is supported for UDP. The connect call is supported for datagram sockets, so subsequent send calls can omit the destination address. In this section, we discuss the how the connect socket call is processed for both TCP and UDP.

The connect Call and UDP

The connect call can be made for a UDP socket. When connect is called by the user, the destination address is specified. The main purpose of connect for UDP is to establish the route to the destination and enter it in the routing cache. Once a route is established, subsequent packet transmissions through the UDP socket can use the cached route information. This is called the fast path for connected socket. When connect is called on an open SOCK_DGRAM type socket, the function udp_connect, in file linux/net/ipv4/udp.c, is called by the socket layer. Sk is a pointer to the sock structure for the open socket, and uaddr is the destination address to which we want to create a route.

int udp_connect(struct sock *sk,
 struct sockaddr *uaddr, int addr_len)

IPv4-specific address and option information is in the inet_opt structure.

struct inet_opt *inet = inet_sk(sk);
struct sockaddr_in *usin =(struct  sockaddr_in *) uaddr;

Rt is a pointer to a route cache entry. See "The Network Layer, IP," for more information about the route cache. Oif is the index of the output network interface that will carry the packet using the route to the destination.

struct rtable *rt;
u32 saddr;
int oif;
int err;

First, we make sure that the specified address is in the correct format for an Internet address.

if (addr_len < sizeof(*usin))
return -EINVAL;
if (usin->sin_family != AF_INET)

We call sk_dst_reset to free any old destination cache entry pointed to by the dst field in the sock structure, sk.


Since this is may be a bound socket, the network interface may already be known. If so, oif will be index of the outgoing network interface, and if not, it will be zero. Next, we check to see if the destination address in usin is a multicast address. If it is multicast, it doesn’t mean that the user is trying to connect to a multicast address; instead, it means that the user will be sending subsequent packets to the same address. In addition, if the destination address is multicast, we get the output interface and source address from the inet_opt structure in the sock, sk.

oif = sk->bound_dev_if;
saddr = sk->saddr;
if (MULTICAST(usin->sin_addr.s_addr)) {
if (!oif)
oif = sk->protinfo.af_inet.mc_index;
if (!saddr)
saddr = sk->protinfo.af_inet.mc_addr;

This is the most important call in this function. Ip_route_connect gets a route to the destination address in using and sets rt to point to the new cached route. If it returns a nonzero value, it wasn’t able to find a route or add a new one to the routing cache, so we return the error. If it found a broadcast route, we return an error.

err = ip_route_connect(&rt, usin->sin_addr.s_addr, saddr,
RT_CONN_FLAGS(sk), oif, IPPROTO_UDP, inet->sport,
usin->sin_port, sk);
if (err)
return err;
if ((rt->rt_flags&RTCF_BROADCAST)&& !sock_flag(sk, SOCK_BROADCAST)){
return -EACCES;

Here we update the source address and destination address for outgoing packets from the fields in the route cache entry, rt. The destination port is specified by the user.

sk->saddr = rt->rt_src;
sk->rcv_saddr = rt->rt_src;
sk->daddr = rt->rt_dst;
sk->dport = usin->sin_port;

We set the socket state to TCP_ESTABLISHED to indicate that there is a cached route. This value is misleading for a UDP socket, but it is merely to show that the socket has a cached route associated with it. Later, when the user tries to transmit a packet and the source address is missing in the send call, we consider it OK because the state indicates that there is a route established.

sk->state = TCP_ESTABLISHED;
inet->id = jiffies;

Finally, a pointer to the route cache entry is placed in the destination field, dst, of the socket.

sk_dst_set(sk, &rt->u.dst);

Tcp_v4_connect—Requesting a TCP Connection

Applications using SOCK_STREAM type sockets are classified as either clients or servers. The client requests connections with a server and the server responds to connection requests. The socket call provided to the client to request connections is connect. When connect is executed on an open socket, the socket layer calls a function in the protocol to process the connection request. For SOCK_STREAM type protocols in the AF_INET address family, the function called by the socket layer is tcp_v4_connect defined in file linux/net/ipv4/tcp_ipv4.c.

int tcp_v4_connect(struct sock *sk,
 struct sockaddr *uaddr, int addr_len)
struct inet_opt *inet = inet_sk(sk);

Tp points to the TCP options, tcp_opt, in the sock structure. The TCP options are discussed in Section Rt is a route table cache entry. Later, it will point to the cached route for packets being sent through this socket.

struct tcp_opt *tp = &(sk->tp_pinfo.af_tcp);
struct sockaddr_in *usin =(struct  sockaddr_in *)uaddr;
struct rtable *rt;

Daddr is the IP destination address, and nexthop is the IP address of the gateway router for the route if there is one.

u32 daddr, nexthop;
int tmp;
int err;

After making sure the specified address is in the correct format, we initialize nexthop and the destination address to point the user- specified destination address. Later, nexthop may be changed depending on the route.

if (addr_len < sizeof(struct sockaddr_in))
if (usin->sin_family != AF_INET)
nexthop = daddr = usin->sin_addr.s_addr;

This is a check to see if the source routing option is set for this socket.

if (inet->opt && inet->opt->srr) {
if (daddr == 0)
return -EINVAL;
nexthop = inet->opt->faddr;

Now we attempt to get a route to the destination address given the destination address, which for now is in nexthop, the source address, and the network interface. Ip_route_connect will return a zero if successful. It will set rt to point to the new cached route if it found one.

if (inet->opt && inet->opt->srr) {
if (daddr == 0)
return -EINVAL;
nexthop = inet->opt->faddr;

We don’t allow TCP to connect to nonunicast addresses; therefore, if the route in the route cache is not a unicast route, we return an error.

if (!inet->opt || !inet->opt->srr)
daddr = rt->rt_dst;
if (!inet->saddr)
inet->saddr = rt->rt_src;
inet->rcv_saddr = inet->saddr;
if (tp->ts_recent_stamp &&  inet->daddr != daddr) {

Here, we reset the inherited state.

/* Reset inherited state */
tp->ts_recent = 0;
tp->ts_recent_stamp = 0;
tp->write_seq = 0;
if (sysctl_tcp_tw_recycle &&
!tp->ts_recent_stamp &&  rt->rt_dst == daddr) {
struct inet_peer *peer = rt_get_peer(rt);
if (peer && peer->tcp_ts_stamp +  TCP_PAWS_MSL >= xtime.tv_sec) {
tp->ts_recent_stamp =  peer->tcp_ts_stamp;
tp->ts_recent = peer->tcp_ts;

Refer to,” Receiving Data in the Transport Layer, UDP and TCP,” for more information about the TIME_WAIT state. The comments in the code state that this timestamp saving idea comes from an idea of Van Jacobson. The last used timestamp values are saved in the inet_peer structure associated with the route entry when the socket is in the TIME_WAIT state. Then, when a new connection is requested, the recent received timestamp, and the timestamp fields in the tcp_opt structure, are restored from the saved values in the inet_peer structure. This is a method for detecting duplicate segments once the connection is reestablished.

inet->dport = usin->sin_port;
inet->daddr = daddr;
tp->ext_header_len = 0;
if (inet->opt)
tp->ext_header_len =  inet->opt->optlen;

This is the minimum allowed MSS value.

tp->mss_clamp = 536;

Although, we are in the process of doing an active open, the identity of the socket is not known. We don’t know the source port, sport, because we have not yet assigned the ephemeral port. However, we put ourselves in the TCP_SYN_SENT state and enter sk into the TCP connection’s hash table. The remainder of the work for initialization the connection will be finished later.

tcp_set_state(sk, TCP_SYN_SENT);

We call tcp_v4_hash_connect to enter the socket in the connect hash table. This assigns an ephemeral port for the connection, and puts it in the sport field of the sock. The ephemeral port number serves as the hash code for finding the connected socket in the hash table. Later, when the socket is in ESTABLISHED state and incoming packets arrive for the port, TCP can quickly find the correct open socket.

err = tcp_v4_hash_connect(sk);
if (err)
goto failure;
err = ip_route_newports(&rt,  inet->sport, inet->dport, sk);
if (err)
goto failure;

We commit the destination for this connection by setting the dst field of the sock structure, sk, to point to the destination associated with the route. When a route is for a packet sent to an external destination, dst will point to the gateway. However, if the destination is for a directly connect host, dst will point to information about the host.

__sk_dst_set(sk, &rt->u.dst);
tcp_v4_setup_caps(sk, &rt->u.dst);
tp->ext2_header_len =  rt->u.dst.header_len;

We initialize the first sequence number.

if (!tp->write_seq)
tp->write_seq =secure_tcp_sequence_number(inet->saddr,
inet->id = tp->write_seq ^ jiffies;

We call tcp_connect to complete the work of setting up the connection including transmitting the SYN.

err = tcp_connect(sk);
rt = NULL;
if (err)
goto failure;
return 0;

If we failed, we take the socket out of the hash table for connected sockets and release the local port.

tcp_set_state(sk, TCP_CLOSE);
sk->sk_route_caps = 0;
inet->dport = 0;
return err;

Sending Data From a Socket via UDP

In this section, we show how the UDP protocol processes a request from the application layer to send a datagram. When a user program writes data to a SOCK_DGRAM type socket by calling sendmsg or any other application layer sending function, the UDP protocol will process the request. UDP is fairly simple, and mostly consists of a framing layer. It doesn’t do much more than pre-pend the user data with the UDP header. Unlike TCP, there is no buffering or connection management. To construct the UDP header, the destination and source port are required. Usually, the destination port and IP address are specified together in the peer socket’s name. Generally, when the port is known,the IP address is known, too. If the destination IP address is specified in the socket name, it is saved before the packet is passed to IP. In addition, before passing the packet on to IP, UDP checks to if the destination is internal or whether IP must determine a route for the packet. Moreover, the destination address may be a multicast or broadcast address, in which case IP does not need to route the packet. Before UDP is done, the source and destination ports are placed in the UDP header and the IP address is kept for later processing by IP.

It is interesting to note how the Linux TCP/IP stack is built for efficiency. All efforts have been taken to avoid copying or unnecessary processing. Almost all the complexity of processing is outside the data path of the packet. There is only one copy of the actual data and only one level of processing in UDP.

When the application calls one of the write functions on an open socket, the socket layer calls the function pointed to by the sendmsg field in the prot structure. See, "Linux Sockets," for detailed information about what happens at the socket layer. This results in a call to udp_sendmsg, which is found in file linux/net/ipv4/udp.c. This is the sending function that is executed for SOCK_DGRAM type sockets. See Table for a list of all the protocol block functions for UDP.

int udp_sendmsg(struct sock *sk,
 struct  msghdr *msg, int len)

We retrieve the protocol specific parts of the sock structure.

struct inet_opt *inet = inet_sk(sk);
struct udp_opt *up = udp_sk(sk);

Ipc will hold the return from the function ip_cmsg_send discussed in Section

int ulen = len;
struct ipcm_cookie ipc;
struct rtable *rt = NULL;
int free = 0;
int connected = 0;

Daddr is for the destination IP address, and tos is for the TOS field of the IP header.

u32 daddr, faddr, saddr;
u16 dport;
u8 tos;
int err;
int corkreq = up->corkflag ||msg->msg_flags&MSG_MORE;

The first thing udp_sendmsg does is check for an out-of-range value in the len field, and checks to see if the caller has requested any illegal flags for this type of socket. The only illegal flag for UDP is MSG_OOB, which is used only for SOCK_STREAM type sockets.

if (len < 0 || len > 0xFFFF)
return -EMSGSIZE;
if (msg->msg_flags&MSG_OOB)

Now, check to see if there are any pending frames. The socket lock must be held while the socket is corked for processing the pending frames.

if (up->pending) {
if (likely(up->pending)) {
if (unlikely(up->pending != AF_INET)) {
return -EINVAL;
goto do_append_data;

We add the UDP header length to the length of the user data.

ulen += sizeof(struct udphdr);

Next, udp_sendmsg checks to make sure that either there is a valid destination address specified or the socket is in a connected state. It verifies the address by checking the name field in the msghdr structure. If the application caller invoked the sendto socket API call, then the name field of the msghdr structure will contain the address originally specified in the to field of the sendto call.

if (msg->msg_name) {
struct sockaddr_in * usin =(struct sockaddr_in*)msg->msg_name;
if (msg->msg_namelen < sizeof(*usin))
return -EINVAL;
if (usin->sin_family != AF_INET) {
if (usin->sin_family != AF_UNSPEC)
return -EINVAL;

At this point, we get the destination and source addresses. The port is placed in the dest field of the udphdr structure pointed to by uh in the fakehdr structure.

daddr = usin->sin_addr.s_addr;
dport = usin->sin_port;
if (dport == 0)
return -EINVAL;
} else {

We know that the destination address specified was NULL. However, we can allow the packet to be transmitted if the socket is a connected UDP socket, which is indicated when the state field is set to TCP_ESTABLISHED. If the socket is connected, we assume that the destination address is already known.

if (sk->state != TCP_ESTABLISHED)
daddr = inet->daddr;
dport = inet->dport;

If the socket is connected, then routing in the IP layer can use the “fast path” to bypass a routing table lookup and use the destination cache entry directly.

connected = 1;
. . .

The Udphdr structure, defined in file linux/include/linux/udp.h, contains the actual UDP header including the source port, destination port, length, and checksum. In the udp_sendmsg function, the UDP header is a field in ufh, the UDP fake header structure.

struct udphdr {
__u16 source;
__u16 dest;
__u16 len;
__u16 check;
} ;

If no valid destination address is found, udp_sendmsg returns EINVAL.

Handling Control Messages

Let’s continue with the udp_sendmsg function to see how it handles the control messages. The next thing udp_sendmsg does is determine if the argument msg points to a control message. It checks the field, msg_controllen, for nonzero value. The structure ipcm_cookie, pointed to by ipc, holds the result of the control message processing. Initially, some fields of ipc are initialized, such as the IP options and the interface (if there is one bound to this socket).

ipc.addr = inet->saddr;
ipc.oif = sk->bound_dev_if;
if (msg->msg_controllen) {

The control messages are also called ancillary data and are part of the IPv6 sockets interface specification. For more information on control messages, see the man page, cmsg(3), and "Advanced Sockets API for IPv6," [RFC 2292].

If msg contains a pointer to a control message, the function ip_cmsg_send, defined in file linux/net/ipv4/ip_sockglue.c, processes the request. These control messages are a way of setting and retrieving UDP information such as the address and port. Fields in the inet options structure, which contains the addressing information, can be retrieved directly by the control message.

err = ip_cmsg_send(msg, &ipc);
if (err)
return err;
if (ipc.opt)
free = 1;
connected = 0;
if (!ipc.opt)
ipc.opt = inet->opt;
saddr = ipc.addr;
ipc.addr = faddr = daddr;
if (ipc.opt && ipc.opt->srr) {
if (!daddr)
return -EINVAL;
faddr = ipc.opt->faddr;
connected = 0;
. . .

Ip_cmsg_send returns the results in ipc, which points to a structure called ipcm_cookie, defined in linux/include/linux/ip.h. struct ipcm_cookie

u32 addr;
int oif;
struct ip_options *opt;
} ;

Two types of control messages are processed by UDP. I_RETOPTS retrieves the options field from the IP header and returns a pointer to the options in the opt field of ipc. If IP_PKTINFO was specified in the control message, ip_cmsg_send returns the interface index in the oif field of ipc and the interface’s IP address in the addr field.

Passing the Packet to IP Output

We continue with our examination of the udp_msgsend function. At this point, most of the information needed for the UDP header is established. Now, we will check to see how this packet is supposed to be routed.

. . .
tos = RT_TOS(sk->protinfo.af_inet.tos);
if (sk->localroute ||
(msg->msg_flags&MSG_DONTROUTE) ||
(ipc.opt &&  ipc.opt->is_strictroute)) {
tos |= RTO_ONLINK;
connected = 0;

In addition, if the packet is for transmission through a connected socket, it is assumed that the destination is already known and there is already a route to the destination address in this packet. If the destination address is a multicast address, there is no need to route the packet either.

if (MULTICAST(daddr)) {
if (!ipc.oif)
ipc.oif = inet->mc_index;
if (!saddr)
saddr = inet->mc_addr;
connected = 0;
if (connected)
rt = (struct rtable*)sk_dst_check(sk, 0);
if (rt == NULL) {

If we don’t have a route, we prepare for a search of the routing table by building an flow information structure.

struct flowi fl = { .oif = ipc.oif,
.nl_u = { .ip4_u =
{ .daddr = faddr,.saddr = saddr,.tos = tos } } ,
.proto = IPPROTO_UDP,
.uli_u = { .ports =
{ .sport = inet->sport,
.dport = dport } } } ;

We call ip_route_output to try to come up with a route.

err = ip_route_output_flow(&rt, &fl,sk,
if (err)
goto out;
err = -EACCES;

If the routing cache entry indicates broadcast but the socket did not have the SO_BROADCAST flag set, then it is an error.

if (rt->rt_flags&RTCF_BROADCAST  &&
!sock_flag(sk, SOCK_BROADCAST)goto out;

Now that there is a route, if the socket is connected, we set a pointer to the route in the destination cache.

if (connected)
sk_dst_set(sk, dst_clone(&rt->u.dst));

Udp_msgsend checks if the application requested routing confirmation by looking for the flag MSG_CONFIRM in the flags field of the msghdr structure. If required, dst_confirm is called to confirm the route.

if (msg->msg_flags&MSG_CONFIRM)
goto do_confirm;

At this point, ip_build_xmit is called to pass the packet on to the IP layer.

saddr = rt->rt_src;
if (!ipc.addr)
daddr = ipc.addr = rt->rt_dst;
if (unlikely(up->pending)) {

Check that the sock is already corked. This must be a cork application bug.

NETDEBUG(if (net_ratelimit())printk(KERN_DEBUG "udp cork app
bug 2 n"));
err = -EINVAL;
goto out;

Cork the socket in order to append additional data.

inet->cork.fl.fl4_dst = daddr;
inet->cork.fl.fl_ip_dport = dport;
inet->cork.fl.fl4_src = saddr;
inet->cork.fl.fl_ip_sport =  inet->sport;
up->pending = AF_INET;

Now we send the data to IP by calling ip_append_data, which builds a large datagram from individual pieces of data. The second argument,ip_generic_getfrag, is a callback function executed by IPv4 when it is ready to copy the actual data from user space to the datagram.

up->len += ulen;
err = ip_append_data(sk,ip_generic_getfrag,msg->msg_iov,  ulen,
sizeof(struct udphdr), &ipc, rt,
corkreq ? msg->msg_flags|MSG_MORE : msg->msg_flags);
if (err)
else if (!corkreq)
err = udp_push_pending_frames(sk, up);
if (free)
if (!err) {
return len;
return err;

If a neighbor cache entry exists for this route, dst_confirm timestamps an existing entry in the neighbor cache.


If the MSG_PROBE flag was set for this socket, the transmit path should be probed for maximum MTU but the packet should not be sent.

if (!(msg->msg_flags&MSG_PROBE) ||len)
goto back_from_confirm;
err = 0;
goto out;

Copying Data from User Space to a Datagram

This section covers the function that gets pieces of data from user space to append to a datagram, ip_generic_getfrag. This function is defined in linux/net/ipv4/dp_output.c. The function is called as a callback from ip_append_data function, also in the same file.

ip_append_data is executed whether or not checksums have been disabled. The user data is organized in one or more buffer pointers referenced by iov. Generally, there is only one buffer for each UDP datagram.

int ip_generic_getfrag(void *from, char *to,  int offset, int len,
int odd, struct sk_buff *skb)

The UDP header does not need to be copied from user space to kernel space. It has already been built in kernel space by the udp_msgsend function. We call one of two functions to copy the user data depending on whether our output hardware has hardware checksum capability.

One of two functions, csum_partial_copy_fromiovecend, calculates the partial checksums while copying. The other function, memcpy_fromiovecend, is the one we use if we will be calculating the checksum later in hardware because this function does not calculate a checksum. A UDP checksum includes parts of the IP header, the source and destination IP addresses, IP header protocol, and length fields. Therefore, the checksum calculation is done in three stages: once for the data, again for the actual UDP header, and then a third time with the fields from the IP header.

struct iovec *iov = from;
if (skb->ip_summed == CHECKSUM_HW) {
if (memcpy_fromiovecend(to, iov, offset, len)  < 0)
return -EFAULT;
} else {
unsigned int csum = 0;

The data is copied from user space to kernel space by the iovec utility routine. The checksum is calculated on the data while copying for efficiency. It is important to avoid manipulating the packet contents twice. The data must be copied from user space via the iov pointer to a location within a sk_buff in kernel space referenced by the to argument.

if (csum_partial_copy_fromiovecend(to,iov,offset,len,&csum)< 0)
return -EFAULT;
skb->csum = csum_block_add(skb->csum, csum, odd);
return 0;

Sending Data From a Socket via TCP

The best way to describe the implementation of a complex protocol like TCP is to follow the data as it flows through the protocol. For the remaining part of this chapter, the send-side processing of TCP will be examined. It is important to keep in mind that TCP is probably the most complex part of the TCP/IP protocol suite. Because TCP provides a connection-oriented service at the transport layer, it is far more complicated than UDP. TCP must manage the relationship between the local host and the remote host, including retransmitting bad and lost packets, managing the connection state machine, buffering the data, and setting up and breaking down the connection. In addition, as will be seen, it does much more than this. Linux TCP implements all of the enhancements for security, reliability, and performance that have been developed in more than 20+ years since TCP was first specified in RFC 793. This section focuses primarily on the connection with the socket, how data is removed from the application and placed in queues for transmission. Next, Section discusses the actual packet transmission, covers the data structures used to keep control variables and configuration options, and covers the TCP timers used primarily on the send side of the connection. Figure shows the TCP connection state for the sending side.

TCP send state diagram


Let’s look at how the send side of the protocol handles write information passed from the socket layer. Once an application layer program opens a socket of type SOCK_STREAM, the TCP protocol is invoked to process all write requests for data transfers through the open socket. The transmission of data through TCP is controlled primarily by the state of the connection with the peer and the availability of data in the send buffer. Because of the asynchronous nature of TCP, actual data transmission is independent of the application layer, as it writes data into the socket layer buffers. In general, TCP gathers the data into segments and passes it to the peer machine when there is available bandwidth in the network. This is different from UDP where the userlevel process controls transmission, because with UDP, a write of data to the socket results directly in the transmission of a datagram. When any of the socket layer write functions are invoked by the application, TCP copies the data into a list of socket buffers, which are queued up for later transmission. Most of the work associated with writing in a socket involves determining the type of socket buffer and managing the queue of buffers. It must be determined whether the transmission interface has scatter-gather capability, otherwise known as chained DMA. If the interface has scatter-gather support, TCP sets up a transfer of a chain of buffers to the networking device. In this case, TCP uses the fragment list in the shared info structure of the skbuf. In addition, TCP maintains a queue of socket buffers, so the send function must also determine if there is room for more data in the current socket buffer or if a new one must be allocated.

The Tcp_sendmsg Function

The tcp_msgsend function, defined in file linux/net/ipv4/tcp.c, is invoked when any user-level write or message sending function is invoked on an open SOCK_STREAM type socket. All the write functions are converted to calls to this function at the socket layer. The best way to show the operation of TCP on the send side is to examine this function, so we will follow it to see how it processes the data.
The socket layer is able to find tcp_sendmsg because it is referenced by the sendmsg field of the protocol block structure, prot, which is initialized at compile time in the file linux/net/ipv4/tcp_ipv4.c. The protocol block functions for TCP are shown in Table Tcp_sendmsg collects all the TCP header information possible, copies the data into socket buffers, and queues the socket buffers for transmission. It makes heavy use of the TCP options structure described. Tcp_sendmsg also sets many fields in the TCP control block structure, described in Section, which is used to pass TCP header information to the transmission side of the TCP protocol.

int tcp_sendmsg(struct sock *sk,
 struct msghdr *msg, int size)

The variable iov is to retrieve the IO vector pointer in the message header structure in the argument msg. Tp points to the TCP options that will be retrieved from the sock structure sk. Skb is a pointer to a socket buffer that will be allocated to hold the data to be transmitted. Iovlen is set to the number of elements in the iovec.

struct iovec *iov;

The TCP options are retrieved from the sock structure and the sock structure is locked.

struct tcp_opt *tp = tcp_sk(sk);
struct sk_buff *skb;
int iovlen, flags;

Mss_now holds the current maximum segment size (MSS) for this open socket.

int mss_now;
int err, copied;
long timeo;
flags = msg->msg_flags;

The value of the SO_SNDTIMEO option is put in timeo unless the MSG_DONTWAIT flag was

set for the socket.
timeo = sock_sndtimeo(sk, flags&MSG_DONTWAIT);

The next thing tcp_sendmsg does is wait for a connection to be established before sending the packet. The state of the connection for this open socket is checked, and if the connection is not already in the TCPF_ESTABLISHED or TCPF_CLOSE_WAIT state, TCP is not ready to send data so it must wait for a connection to be established. The value of the timeout set with the SO_SNDTIMEO socket option is passed into the wait_for_tcp_connect function.

if ((1 << sk->state) &~(TCPF_ESTABLISHED|TCPF_CLOSE_WAIT))
if((err = wait_for_tcp_connect(sk, flags, &timeo)) != 0)
goto out_err;

The SOCK_ASYNC_NOSPACE bit in flags is cleared to indicate that this socket is not currently waiting for more memory.

clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags);

Mss_now is set to the current mss for this socket. The function tcp_current_mss takes into account MTU discovery, the negotiated MSS, and the TCP_MAXSEG option

mss_now = tcp_current_mss(sk);

Now, set up to send the data. Get iov and iovlen from the msghdr structure. Get ready to start copying the data into socket buffers.

iovlen = msg->msg_iovlen;
iov = msg->msg_iov;
copied = 0;
err = -EPIPE;

If the application called shutdown with how set to one, this means that a socket shutdown on the send side of the socket has been requested, so get out.

if (sk->err || (sk->shutdown&SEND_SHUTDOWN))
goto do_error;

The IO vector structure is the mechanism where data is retrieved from user space into kernel space. Iovlen is the number of iov buffers queued up by the socket send call.

while (—iovlen >= 0) {

Seglen is the length of each iov, and from points to the data to be copied.

int seglen=iov->iov_len;
unsigned char * from=iov->iov_base;
while (seglen > 0) {
int copy;
skb = sk->write_queue.prev;

If send_head is null, there are not yet any segments queued for sending. Mss_now is the current negotiated MSS.

if (!tp->send_head == NULL(copy = mss_now - skb->len) <= 0) {

If send_head is NULL, it is necessary to allocate a new segment. Tcp_alloc_pskb is used as a TCP-specific wrapper for alloc_skb. It allocates a paged socket buffer. Paged socket buffers include the socket buffer shared info and are used for efficient handling of TCP segments and IP fragments. They are particularly suitable for network interfaces that have scatter-gather DMA capability that can send a chain of buffers via DMA with minimum overhead. Tcp_sendmsg calls select_size to round off the requested size to the nearest TCP segment size or page size determined from the MSS. Skb_entail sets up the flags and sequence numbers in the TCP control block and puts the new skb on the write queue. Copy is the amount of data to copy to the segment. It is set to mss_now, the most recent negotiated value of MSS, which is the best indicator of segment size at this point.

if (!tcp_memory_free(sk))
goto wait_for_sndbuf;
skb = tcp_alloc_pskb(sk,
select_size(sk, tp),0,sk->allocation);
if (!skb)
goto wait_for_memory;

Check to see whether we can use the hardware checksum capability.

if (sk->sk_route_caps &
skb->ip_summed = CHECKSUM_HW;
skb_entail(sk, tp, skb);
copy = mss_now;
. . .

Copying the Data from User Space to the Socket Buffer

If possible, tcp_sendmsg tries to squeeze the data into the header portion of the skb before allocating a new segment, so it tries to determine the actual address where the data will be copied.

 . . .
if (copy > seglen)
copy = seglen;
if (skb_tailroom(skb) > 0) {

Skb_tailroom returns the amount of room in the end of the skb, and copy is set to the amount of available space if there is any. Skb_add_data copies the data to the skb and calculates the checksum while it is doing the copy. The partial checksum in progress is placed in the csum field of the socket buffer, skb. In "Linux Memory Allocation and Skbuffs," Table includes a description of the socket buffer structure.

if (copy > skb_tailroom(skb))
copy = skb_tailroom(skb);
if ((err = skb_add_data(skb, from, copy)) !=  0)
goto do_fault;
} else {

If there was no tailroom in the main part of the socket buffer, skb, we try to find room in the last fragment attached to skb. If there is no room in the fragment, we allocate a new page and attach it by putting a pointer to it at the end of the frags array in the shared info part of the skb. Merge is set to one if there is any room in the last page, and i is set to the number of frags already in the socket buffer. Page is declared to point to a memory-mapped page, and off is set to the offset to the start of the data to be copied from the msghdr structure. TCP_PAGE is used several times in the tcp_sendmsg function. It is a macro that updates the sndmsg_page fiel in the tcp_options structure to hold a pointer to the current page in the frags array. The macro, TCP_OFF, also defined in linux/net/ipv4/tcp.c, updates the sndmsg_off field with an updated offset into the page.

Both these macros are defined in tcp.c.

int merge = 0;
int i = skb_shinfo(skb)->nr_frags;
struct page *page = TCP_PAGE(sk);
int off = TCP_OFF(sk);

Tcp_sendmsg checks to see if there is space in the last page in the socket buffer by calling can_coalesce, which returns a one if there is room. Next, the function determines if a new page is needed or a complete new socket buffer must be allocated. To do this, tcp_sendmsg checks if the frag slots in the skb are full or whether the outgoing interface has scatter-gather capability. SG means that the interface hardware can efficiently exploit socket buffers with attached pages by doing segmented DMA.

if (can_coalesce(skb, i, page, off) && off != PAGE_SIZE) {
merge = 1;

We check to see if there is any reason why we can’t add a new fragment.

} else if (i == MAX_SKB_FRAGS ||
(!i &&  !(sk->route_caps&NETIF_F_SG))) {

Perhaps all the page slots are used or the interface is not capable of scatter-gather DMA. In any case, we know that we can’t add more data to the current segment. Therefore, we call tcp_mark_push to set the TCPCB_FLAG_PSH flag in the control buffer for this connection. This will mean that the PSH flag will be set in the TCP header of the current skb, and Tcp_sendmsg jumps to the new_segment label to allocate a new skb.

tcp_mark_push(tp, skb);
goto new_segment;
} else if (page) {

If the page has been allocated, off is aligned to the page boundary. Then, as an extra error check,off is checked for validity value, and if it isn’t valid, the page is freed and removed from the skb.

off =  (off+L1_CACHE_BYTES-1)&~(L1_CACHE_BYTES-1);
if (off == PAGE_SIZE) {
TCP_PAGE(sk) = page = NULL;

A new page is allocated if necessary, and copy, the length of data, is corrected for the page size.

if (!page) {

Allocate the new cache page.

if (!(page=tcp_alloc_page(sk)))
goto wait_for_memory;
off = 0;
if (copy > PAGE_SIZE-off)
copy = PAGE_SIZE-off;

Finally, tcp_sendmsg is ready to copy the data from user space to kernel space. It calls tcp_copy_to_page to do the copying, which in turn calls csum_and_copy_from_user to copy the data efficiently by simultaneously calculating a partial checksum. If tcp_copy_to_page returns an error, the allocated but empty page is attached to the sock structure for this open socket, sk, in the sndmsg_page field so the page will be de-allocated when the socket is released.

err = tcp_copy_to_page(sk, from,skb,page, off,copy);
if (err) {
if (TCP_PAGE(sk) == NULL) {
TCP_PAGE(sk) = page;
TCP_OFF(sk) = 0;
goto do_error;

Now that the copy has been done, the skb is updated to reflect the new data. If merge is nonzero, it means that the data was merged into the last frag, so the size field in that frag must be updated. Otherwise, a new page was allocated, so fill_page_desc is called, which updates the frag array in the skb with information about the new page. The sndmsg_page field in the tcp_options structure is updated to point to the next page, and the sndmsg_off field gets an updated offset into the page.

if (merge) {
skb_shinfo(skb)->frags[i-1].size += copy;
} else {
fill_page_desc(skb, i, page, off, copy);
if (TCP_PAGE(sk)) {
} else if (off + copy < PAGE_SIZE) {
TCP_PAGE(sk) = page;
TCP_OFF(sk) = off + copy;
. . .

Tcp_sendmsg Completion

At this point, most of the work of tcp_sendmsg is done. The data has been copied from user space to the socket buffer. The socket buffer’s frags array has been updated with a list of pages ready for sending segments. A zero value in the variable copied indicates that this is the first time through the while loop for the initial segment so the PSH flag in the TCP header is set to zero. As we will see in this section, the TCP header is not set directly at this point. Instead, the intended values are saved in the TCP control block for later when the queued socket buffers are removed for transmission. Stevens has an excellent discussion in Section 20.5 of the use of the PSH flag in TCP [STEV94].

. . .
if (!copied)
TCP_SKB_CB(skb)->flags &=  ~TCPCB_FLAG_PSH;

The write_seq field in the TCP options structure is updated with the amount of data processed in this trip through the while loop. The end_seq field in the TCP control block is also updated with the amount of data that was processed. From, which points to the data source in the msg argument, and copied, which holds the total number of bytes processed so far, are updated for this iteration. Iovlen is the number of iovs in the msghdr structure, and seglen was initialized to this value at the start of the function. Each loop through this code processes one iov.

tp->write_seq += copy;
TCP_SKB_CB(skb)->end_seq += copy;
from += copy;
copied += copy;
if ((seglen -= copy) == 0 && iovlen  == 0)
goto out;
if (skb->len != mss_now ||(flags&MSG_OOB))

At this point, we decide if small segments should be pushed out or held. We call forced_push to check if we must send now no matter what the segment size is. If so, the PSH flag is set and the segments are transmitted.

if (forced_push(tp)) {
tcp_mark_push(tp, skb);
__tcp_push_pending_frames(sk, tp, mss_now,TCP_NAGLE_PUSH);
} else if (skb == tp->send_head)
tcp_push_one(sk, mss_now);

This is where tcp_sendmsg ends up if there is insufficient number of buffers or pages. It waits until there are a sufficient number of buffers available.

if (copied)
tcp_push(sk, tp, flags&~MSG_MORE,mss_now, TCP_NAGLE_PUSH);
if ((err = wait_for_tcp_memory(sk,&timeo)) != 0)
goto do_error;
mss_now = tcp_current_mss(sk);

This is the label where the code jumped if it is time to push the remaining data and exit the function. The value of copied is returned, indicating to the application program how much data was transmitted.

if (copied)
tcp_push(sk, tp, flags, mss_now,tp->nonagle);
return copied;

These last three labels mean that we are at the end of tcp_sendmsg. This is where we finally end up if errors were detected in the process of copying and processing the data.

if (skb->len == 0) {
if (tp->send_head == skb)
tp->send_head = NULL;
__skb_unlink(skb, skb->list);
tcp_free_skb(sk, skb);
if (copied)
goto out;
err = tcp_error(sk, flags,err);
return err;

TCP Output

The previous discussion about TCP in Section focused primarily on how the TCP protocol was interfaced to the socket, and how data was removed from the application and placed in queues for transmission. In this section, we cover the actual packet transmission (see Figure).

TCP transmit sequence.

Transmit the TCP Segments, the Tcp_transmit_Skb Function

Now it is time to transmit the TCP segments, and the function tcp_Transmit_skb does the actual packet transmission. It sends the packets that are queued to the socket. It can be called from anywhere in TCP state processing when there is a request to send a segment. Earlier, we saw how tcp_sendmsg readied the segments for transmission and queued them to the socket’s write queue. In tcp_transmit_skb, we build the TCP packet header and pass the packet on to IP.

int tcp_transmit_skb(struct sock *sk,struct sk_buff *skb)

If we receive a NULL socket buffer, we do nothing.

if(skb != NULL) {

Inet points to the inet options structure, and tp points to the TCP options structure. It is in tcp_opt where the socket keeps most of the configuration and connection state information for TCP. Tcb points to the TCP control buffer containing most of the flags as well as the partially constructed TCP header. Th is a pointer to the TCP header. Later, it will point to the header part of the skb, and sysctl_flags is for some critical parameters configured via sysctl and setsockopt calls.

struct inet_opt *inet = inet_sk(sk);
struct tcp_opt *tp = inet_sk(sk);
struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
int tcp_header_size = tp->tcp_header_len;
struct tcphdr *th;
int sysctl_flags;
int err;
sysctl_flags = 0;

Here, we check to see if this outgoing packet is a SYN packet, and if so, we check for the presence of certain TCP options in the flags field of the control buffer structure. This is because the TCP header length may need to be extended to account for certain TCP options, which generally include timestamps, window scaling, and Selective Acknowledgement (SACK) [RFC 2018].

if (tcb->flags & TCPCB_FLAG_SYN) {
tcp_header_size = sizeof(struct tcphdr) +  TCPOLEN_MSS;
if(sysctl_tcp_timestamps) {
tcp_header_size += TCPOLEN_TSTAMP_ALIGNED;
sysctl_flags |= SYSCTL_FLAG_TSTAMPS;
if(sysctl_tcp_window_scaling) {
tcp_header_size += TCPOLEN_WSCALE_ALIGNED;
sysctl_flags |= SYSCTL_FLAG_WSCALE;
if(sysctl_tcp_sack) {
sysctl_flags |= SYSCTL_FLAG_SACK;
if(!(sysctl_flags & SYSCTL_FLAG_TSTAMPS))
tcp_header_size += TCPOLEN_SACKPERM_ALIGNED;
} else if (tp->eff_sacks) {

The following processing is for the SACK option. If we are sending SACKs in this segment, we increment the header to account for the number of SACK blocks that are being sent along with this packet. The header length is adjusted by eight for each SACK block.

tcp_header_size += (TCPOLEN_SACK_BASE_ALIGNED  +
(tp->eff_sacks * TCPOLEN_SACK_PERBLOCK));

Now that we know the size of the TCP header, we adjust the skb to allow for sufficient space.

th = (struct tcphdr *) skb_push(skb, tcp_header_size);
skb-> = th;
skb_set_owner_w(skb, sk);

At this point, the TCP header is built, space for the checksum field is reserved, and the header size is calculated. Some of the header fields used to build the header are in inet_opt structure, some are in the TCP control buffer, and some are in the TCP options structure.

th->source = inet->sport;
th->dest = inet->dport;
th->seq = htonl(tcb->seq);
th->ack_seq = htonl(tp->rcv_nxt);
*(((__u16 *)th)+6)=htons(((tcp_header_size >> 2)<<12)|

The advertised window size is determined. If this packet is a SYN packet; otherwise, the window size is scaled by calling tcp_select_window.

if (tcb->flags & TCPCB_FLAG_SYN) {
th->window = htons(tp->rcv_wnd);
} else {
th->window = htons(tcp_select_window(sk));

If urgent mode is set, we calculate the urgent pointer and set the URG flag in the TCP header.

if (tp->urg_mode && between(tp->snd_up, tcb->seq+1,
tcb->seq+0xFFFF)) {
th->urg_ptr =  htons(tp->snd_up-tcb->seq);
th->urg = 1;

Here we actually build the TCP options part of the packet header. If this is a SYN segment, we include the window scale option. If not, it is left out. We check to see if we have a timestamp, window scaling or SACK option from the sysctl_flags set earlier. We call tcp_syn_build_options to build the options with window scaling, and we call tcp_build_and_update_options is for non-SYN packets and do not include window scaling.

if (tcb->flags & TCPCB_FLAG_SYN) {
tcp_syn_build_options((__u32 *)(th + 1),
(sysctl_flags & SYSCTL_FLAG_TSTAMPS),
(sysctl_flags & SYSCTL_FLAG_SACK),
(sysctl_flags & SYSCTL_FLAG_WSCALE),
} else {
tcp_build_and_update_options((__u32 *)(th + 1),
tp, tcb->when);

We call the TCP_ECN_send to send explicit congestion notification. This TCP modification [RFC3168] changes the TCP header to make it slightly incompatible with the header specified by RFC 793.

 TCP_ECN_send(sk, tp, skb, tcp_header_size);

We calculate the checksum. We check the flags to see if we are sending an ACK or sending data and we update the delayed acknowledgment status depending on whether we are sending data or an ACK. we are sending data, we update the congestion window and mark the send timestamp. In addition, we increment the counter to indicate the number of TCP segments that have been sent.

tp->af_specific->send_check(sk, th,skb->len, skb);
if (tcb->flags & TCPCB_FLAG_ACK)
if (skb->len != tcp_header_size)
tcp_event_data_sent(tp, skb);

Now we call the actual transmission function to send this segment. The segment will be queued up for processing by the next stage whether the packet has an internal or an external destination.

err = tp->af_specific->queue_xmit(skb);
if (err <= 0)
return err;

A return of a value less than zero tells the caller that the packet is dropped. We return all errors except explicit congestion notification, which doesn’t indicate to the caller that the packet was dropped.

return err == NET_XMIT_CN ? 0 : err;
return -ENOBUFS;

Some Key TCP Data Structures

This section describes some important structures used in TCP. Most of the data structures are allocated on a per-socket or per-connection basis. The longest and most complicated of these structures is the TCP Options structure discussed in Section This structure holds the TCP options and most of the variables used to maintain the TCP state machine and is accessed through the sock structure.

TCP Control Buffer

TCP is completely asynchronous. All socket-level writes are insulated from the actual packet transmission. It allocates socket buffers to hold the data as the application writes data into the socket. Each packet requires control information, which is passed from the tcp_sendmsg function to the transmission part of TCP. The TCP control block structure, tcp_skb_cb is defined in the file linux/include/net/tcp.h.

struct tcp_skb_cb {
union {

Inet_skb_parm and inet6_skb_parm hold the IP options from an incoming packet, inet_skb_parm is for IPv4, and inet6_skb_parm are the options for IPv6.

struct inet_skb_parm h4;
#if defined(CONFIG_IPV6) || defined  (CONFIG_IPV6_MODULE)
struct inet6_skb_parm h6;
} header;

Seq is the sequence number for the outgoing packet. End_seq is the ending sequence number so far, the sequence number, plus one for the SYN packet,plus one for the FIN, plus the length of this segment.

end_seq = seq + SYN + FIN + length of the current segment.When is used for calculating RTT.

__u32 seq;
__u32 end_seq;
__u32 when;

Flags contains the TCP flags field in the TCP header. The values in this field should match bit for bit the flags in the actual TCP header; Table lists the values. The last two items in the< table are for explicit congestion control and are not defined in the original TCP specification [RFC 793]. They actually are taken from the 6-bit "reserved" area of the TCP header.

__u8 flags;

The sacked field holds the state flags for the Selective Acknowledge (SACK) and Forward Acknowledge (FACK) states. The possible states and their values are listed in Table  __u8 sacked;

Sacked State Flags

The next field, urg_ptr, is the value of the TCP header urgent pointer. The TCPCB_FLAG_URG must be set when this field is valid. The last field, ack_seq, is equivalent to the acknowledgment field in the TCP header.

__u16 urg_ptr;
__u32 ack_seq;
} ;

TCP Options Structure

The sock structure, discussed, contains the support necessary to maintain connection state. Linux TCP, however, implements the TCP options structure, tcp_opt, which is defined in the file linux/include/linux/tcp.h. In addition to TCP options, it contains the send and receive sequence variables, TCP window management, and everything to manage slow start and congestion avoidance. Tcp_opt is presented in its own section because of its complexity and because it contains fields to help manage many aspects of the TCP protocol. Tcp_opt is allocated as part of the sock structure for a SOCK_STREAM type socket.

struct tcp_opt {
int tcp_header_len;

Pred_flags is used to determine if TCP header prediction should be done. Header prediction is used for fast path TCP reception.,

__u32 pred_flags;

The following are the send and receive sequence variables [RFC 793]. These variables govern the sequence numbers exchanged between peers during data exchange through a connection. Rcv_next is the "receive next" sequence variable, RCV.NXT [RFC793]. It represents the next sequence number that is expected in incoming segments.

__u32 rcv_nxt;

Send next sequence variable is SND.NXT. It is the next sequence number to be sent.

__u32 snd_nxt;

This is the send unacknowledged sequence variable, SND.UNA, the oldest unacknowledged sequence number. It should be the next byte for which we should receive an ACK.

__u32 snd_una;

Snd_sml is the last byte in the most recently transmitted small packet.

__u32 snd_sml;

Rcv_tstamp is the timestamp of the last received acknowledge. It is used for maintaining keepalives when the SO_KEEPALIVE socket option is set and for calculating RTT.

__u32 rcv_tstamp;

Lsndtime is the timestamp when transmitted data was last sent. This is used for restart.

__u32 lsndtime;

The ack structure is for controlling delayed acknowledgment. Delayed acknowledgment is implemented to increase TCP efficiency by holding back on acknowl edg ing received data until there is return data to carry the ACK. This reduces the number of ACK s transmitted by combining the ACK with an outgoing data packet wherever possible. A comprehensive explanation of delayed acknowledgments is in Stevens, Section 19.3 [STEV94]. Pending contains the quick acknowledge state, shown in Table. Quick is the scheduled number of quick acknowledgments. If the field pingpong contains the value one, the normal delayed acknowledgment mode is enabled. When pingpong is zero, quick acknowledgment mode is enabled and ACK s are set as soon as possible. Blocked indicates that sending of an ACK was blocked for some reason.

struct {
__u8 pending;
__u8 quick;
__u8 pingpong;
__u8 blocked;

TCP Acknowledge State Values, tcp_ack_state_t

Ato is the calculated delayed acknowledge timeout interval, and timeout holds the actual timeout value for the delayed acknowledgment. Lrcvtime holds the time the last data packet was received.

__u32 ato;
unsigned long timeout;
__u32 lrcvtime;

Last_seg_size is the size of the most recent incoming segment. Rcv_mss holds the best guess of the received MSS of the peer machine.

__u16 last_seg_size;
__u16 rcv_mss;
} ack;

The ucopy structure holds the prequeue for fast copying of data from packets to application space. For more information, see "Receiving the Data in the Transport Layer," Section, TCP Prequeue processing.

struct {
struct sk_buff_head prequeue;
struct task_struct *task;
struct iovec *iov;
int memory;
int len;
} ucopy;

Snd_wl1 holds a received window update sequence number, snd_wnd is the maximum sized window that the peer is willing to receive, SND.WND variable [RFC 793]. In addition, max_window is the largest window received from the peer during the life of this connection.

__u32 snd_wl1;
__u32 snd_wnd;
__u32 max_window;

Pmtu_cookie is set to the last PMTU (Path Maximum Transmission Unit) for the connection referenced by this socket, mss_cache is the current sending MSS, and mss_clamp is the maximum MSS negotiated at connection setup.

__u32 pmtu_cookie;
__u16 mss_cache;
__u16 mss_cache_std;
__u16 mss_clamp;

Ext_header_len is the TCP/IP packet header length, including IP and IP options. Ca_state holds the fast retransmit machine states listed in Table

__u16 ext_header_len;
__u16 ext2_header_len;
__u8 ca_state;

Tcp_ca_state, Fast Retransmit States

Retransmits contains the number of unrecovered RTOs (Retransmit Timeouts). Reordering is a packet reordering metric representing the maximum distance that a packet can be displaced in the stream. Queue_shrunk indicates that the write queue has been reduced in size; a packet has been removed from the wmem_queued socket buffer queue.

__u8 retransmits;
__u8 reordering;
__u8 queue_shrunk;

The following field is for the TCP_DEFER_ACCEPT socket option, which when set allows an application program to sleep until data arrives on a socket. The application program is awakened when data arrives. This option is expressed in seconds, but is converted into the number of retries and set in the defer_accept field.

__u8 defer_accept;

TCP must measure round-trip time (RTT) to support timeout and retransmission based on the Van Jacobson algorithm [JACOB88]; also refer to the Karn & Partridge algorithm [KARN91]. Stevens has a discussion of RTT measurement in Sections 21.3 and 21.4 [STEV94]. The next few fields in the TCP options structure are used for measuring RTT. Backoff is the amount of time to back off before retransmitting, srtt is the smoothed RTT value, and mdev is the medium deviation. Mdev_max is the maximum medium deviation for the most recent RTT period, and rttvar is the smoothed value of mdev_max. Rtt_seq is the sequence number to update rttvar, and rto is the retransmit timeout.

 __u8 backoff;
__u32 srtt;
__u32 mdev;
__u32 mdev_max;
__u32 rttvar;
__u32 rtt_seq;
__u32 rto;

The next three fields are for calculating the number of packets that have been transmitted but not acknowledged—known as packets in flight. Packets_out is the amount of unacknowledged data that has been sent calculated from the number of bytes of data divided by the segment size. Left_out is the number of packets that have arrived out of order, plus the number of packets that have been lost. Retran_out is the number of retransmitted segments.

__u32 packets_out;
__u32 left_out;
___u32 retrans_out;

The following six fields are used for slow start and congestion control as specified by the Nagle algorithm, [RFC 896], the Karn and Partridge algorithm [KARN91], and the Von Jacobson algorithm [JACOB88]. In addition, contains suggested pseudo-code for reducing the congestion window. Snd_ssthresh is the slow start size threshold. Snd_cwnd is the sending side congestion window, and snd_cwnd_cnt is the linear increase counter. Snd_cwnd_clamp is the maximum allowable value for snd_cwnd. Snd_cwnd_used is the window used variable, and snd_cwnd_stamp is the send congestion window timestamp.

__u32 snd_ssthresh;
__u32 snd_cwnd;
__u16 snd_cwnd_cnt;
__u16 snd_cwnd_clamp;
__u32 snd_cwnd_used;
__u32 snd_cwnd_stamp;

These three fields are for timers used widely on both send and receive sides. Timeout is a variable to hold the retransmission timer value, retransmit_timer is the actual timer for TCP retransmit, and delack_timer is the delayed acknowledge timer

unsigned long timeout;
struct timer_list retransmit_timer;
struct timer_list delack_timer;

The socket buffer queue, out_of_order_queue, holds received out of order segments. Af_specific is a pointer to the AF_INET family-specific operations for TCP. Send_head points to the socket buffers queued for transmitting.

struct sk_buff_head out_of_order_queue;
struct tcp_func *af_specific;
struct sk_buff *send_head;

Rcv_wnd is the current receive window size, which is the RCV.WND variable in RFC 793. Rcv_wup, receive window update, is the current advertised window size. Write_seq is the send side sequence number, actually equal to the end of the data in the last send buffer, plus one byte. Pushed_seq is the last pushed sequence number. Copied_seq is the position of the start of received data that has not yet been read by the user process.

__u32 rcv_wnd;
__u32 rcv_wup;
__u32 write_seq;
__u32 pushed_seq;
__u32 copied_seq;

The following fields are used for received TCP options. The TCP options are listed in Table. Some options are received only in SYN packets, and others can be found in any data packet. The next field, tstamp_ok, indicates that a timestamp was received in the SYN packet. Wscale_ok indicates that the window scale option was received in a SYN packet. Window scaling increases TCP efficiency by decreasing the number of ACKs. The window scale option specifies a shift count, which is the number of bits left to shift the window size. This allows the 16-bit window field in the TCP header to represent a much larger window than 65535. See Stevens, Section 24.4 for a discussion of window scaling [STEV94]. Sack_ok indicates that the TCPOPT_SACK option was received in a SYN packet, and saw_tstamp indicates that a timestamp option, TCPOPT_TIMESTAMP, in the most recent packet was seen and processed.

char tstamp_ok,
char saw_tstamp;

TCP Options

Snd_wscale holds the window scaling factor, which was received from the peer. Rcv_wscale is the window scaling factor that is sent to the peer. Both these fields are used with the window scaling option, TCPOPT_WINDOW. Nonagle is set to the number one when it holds the TCP_NODELAY socket option, or set to the number two when it holds the TCP_CORK option. These two socket options are mutually exclusive, and both have the effect of disabling the Nagle algorithm but in opposite ways. When the TCP_NODELAY option is set, data is sent out as soon as possible in small segments without waiting for enough data to fill a full-sized segment and without waiting for any outstanding small segments to be acknowledged. However, if the TCP_CORK option is set, TCP holds on to the small segments as long as the option remains set. The variable is named nonagle because it disables the Nagle algorithm, which allows only one small segment to be outstanding. No other segments can be sent until the first segment is acknowledged by the peer. For a complete discussion of the Nagle algorithm, refer to RFC 896 or review Section 19.4 in Stevens where the algorithm is discussed in detail [STEV94].

__u8 snd_wscale;
__u8 8rcv_wscale;
__u8 nonagle;

Keepalive_probes holds the value of the TCP_KEEPCNT socket option. This option, unique to Linux, specifies the number of keepalive probes that are sent before the connection is dropped. This option is applicable when the SO_KEEPALIVE socket option is also set.

__u8 keepalive_probes;

The next four fields are to support the Protection Against Wrapped Sequence Numbers (PAWS) algorithm [RFC 1323] and Round Trip Time Measurement (RTTM). For a complete analysis of PAWS, see RFC 1323. In addition, Stevens has a complete discussion of RTTM in Section 21.4, and a discussion of the timestamp option and PAWS in Sections 24.5 and 24.6 [STEV94]. Rcv_tsval is the received time stamp value and rcv_tsecr is the value of the received time stamp echo. Ts_recent is the specific received time stamp that will be echoed back in the next timestamp option to be sent. Ts_recent_stamp is used for aging; it is the time that the received timestamp, ts_recent, was stored.

__u32 rcv_tsval;
__u32 rcv_tsecr;
__u32 ts_recent;
long ts_recent_stamp;

The following five fields are for Selective Acknowledgement (SACKS) [RFC 2018]. User_mss is the MSS requested by the application program, and dsack in dicates that a Duplicate SACK (D-SACK) is scheduled to be sent; see RFC 2018, Section 4. Eff_sacks is the size of the array of SACK blocks to send in the next packet. Duplicate_sack is the D-SACK block, and the selective_acks array contains the actual SACKS.

__u16 user_mss;
_u8 dsack;
__u8 eff_sacks;
struct tcp_sack_block duplicate_sack[1];
struct tcp_sack_block selective_acks[4];

Window_clamp is the maximum-sized TCP window to advertise, and rcv_ssthresh is the current window clamp. Rcv_ssthresh is the maximum window size used during slow start phase. Probes_out is the number of zero window probes, which are unanswered zero window probes that have been sent. Stevens [STEV94] Chapter 22 discusses window probes as part of the TCP persist timer. Num_sacks is the number of SACK blocks. Advmss is the advertised MSS. Syn_retires holds the value of the TCP_SYNCNT option, the number of allowed SYN retries.

__u32 window_clamp;
__u32 rcv_ssthresh;
__u8 probes_out;
__u8 num_sacks;
__u16 advmss;
__u8 8syn_retries;

Ecn_flags are for Explicit Congestion Notification (ECN). This field contains the ECN status bits, which correspond to the last two bits in byte 13 of the TCP header. RFC 3168 contains the< specification for ECN. See Section 6 in RFC 3168 for a discussion of the status flags.

__u8 ecn_flags;

Prior_ssthreash holds the slow start threshold value saved from the start of the congestion recovery phase. It is the previous value of the ssthresh. Lost_out is the number of lost packets, the number of segments that were sent but not acknowledged. Sacked_out is the number of segments that were sent by this side and have arrived at the receiver but have been acknowledged with SACKs. Fackets_out is the number of transmitted packets that have been Forward Acknowledged (FACKed). High_seq is set to the value of snd_nxt when congestion is detected< or the loss state is entered. For more information on FACKs, see Mathis and Mahdavi’s papers on forward acknowledgement [MATH96] and [MATH97].

__u16 prior_ssthresh;
__u32 lost_out;
__u32 sacked_out;
__u32 fackets_out;
__u32 high_seq;

Retrans_stamp is set to the timestamp of the most recent retransmit. It is used in the SYN_SENT state to retrieve the time the last SYN packet was sent. Undo_marker indicates when tracking of retransmits have started; it is set to the snd_una variable. Undo_retrans is the number of retransmits that need to be undone if possible. Urg_seq is the sequence number of a received urgent pointer, and urg_data contains the number of the saved byte of Out-of-Band (OoB) data, plus the associated control flags.

__u32 retrans_stamp;
__u32 undo_marker;
int undo_retrans;
__u32 urg_seq;
__u16 urg_data;

It indicates if the MSG_OOB flag is set. Snd_up is the urgent pointer to be sent.

__u32 snd_up;

The following four fields are used for listening for connections. The first two are for the SYN acknowledge hash table. The SYN table is so that sockets in listening mode can efficiently match incoming connection requests. Listen_opt contains the SYN table, syn_table, a hash table of open_requests. Syn_wait_lock is a mutual exclusion (mutex) lock of the SYN table. There is already a master lock at the socket level, but this additional lock is acquired in read mode from tcp_get_info and acquired in write mode wherever the syn_table is updated. The next two fields, accept_queue and accept_queue_tail, hold the list of established connections of sockets that are children of this open socket.

struct tcp_listen_opt*listen_opt;
struct open_request*accept_queue;
struct open_request*accept_queue_tail;

Write_pending indicates that a socket-level write request is pending.


Keepalive_time is the amount of time the connection is allowed to remain idle before keepalive probes are set. It is initialized to two hours by the constant TCP_KEEPALIVE_TIME in file tcp.h. It can be changed with the TCP_KEEPIDLE socket option. Keepalive_intvl is the time between each transmission of a keepalive probe and is initialized to 75 seconds. It can be changed via the socket option, TCP_KEEPINTVL. Both of these TCP socket options are unique to Linux and therefore are not portable. Linger2 holds the value of another unique Linux TCP socket option, TCP_LINGER2. This value governs the lifetime of orphaned sockets in the FIN_WAIT2 state. It overrides the default value of one minute in the TCP_FIN_TIMEOUT constant in file tcp.h.

unsigned int keepalive_time;
unsigned int keepalive_intvl;
int linger2;

Finally, the last field in the TCP options structure, last_synq_overflow, is for the support of syncookies, a security feature in Linux TCP. Syncookies consists of a mechanism to protect TCP from a particular type of denial-of-service attack (DoS) where an attacker sends out a flood of SYN packets. Syncookie are enabled via the tcp_syncookies sysctl value.

unsigned long last_synq_overflow;
} ; 

TCP Timers

This chapter is about sending data, the transmit side of the TCP/IP stack. As discussed earlier, TCP maintains the connection state internally and requires timers to keep track of events. The TCP requires three timers to maintain the state on the transmit side in the protocol. These three timer functions could be implemented in one timer, but Linux uses three separate timers. we discussed the timer facility in Linux. When timers are initialized, they are given an associated function that is called when the timer goes off. Of course, each timer is completely reentrant, and each timer function for TCP is passed a pointer to the sock structure. The timer uses the sock to know which connection it is dealing with The four timers are listed in Table

TCP Transmit Timers

The timer functions can be found in the file linux/net/tcp_timer.c. Both the data pointer and the timer function pointer for each of the retransmit timers are maintained in the TCP options part of the sock structure described in Section The keepalive timer is maintained directly in the sock structure. In addition, functions to manage the timers, also in the file linux/net/timer.c, create, initialize, and delete all the timers. These functions are declared in file linux/include/net/tcp.h. The function tcp_init_xmit_timers initializes all the TCP timers.

void tcp_init_xmit_timers(struct sock *sk);

This function, tcp_clear_xmit_timers, clears all the TCP timers.

void tcp_clear_xmit_timers(struct sock *sk);

Linux TCP provides two functions to manage the individual retransmit, zero probe, and delayed acknowledgment timers. The first function, tcp_clear_xmit_timer, deletes the timer specified by the argument what, which specifies one of the timers from Table.

static inline void tcp_clear_xmit_timer
(struct sock *sk, int what);

The second function,tcp_reset_xmit_timer, resets the timer to expire at the time specified by the argument when.

static inline void tcp_reset_xmit_timer
(struct sock *sk, int what, unsigned long when);

In addition, there are two separate functions to delete and reset the keepalive timers, tcp_delete_keep_alivetimer and tcp_reset_keepalive_timer.

extern void tcp_delete_keepalive_timer
(struct sock *);
extern void tcp_reset_keepalive_timer
(struct  sock *, unsigned long);
Tcp_set_keepalive sets the keepalive timeout
to the value, val.
void tcp_set_keepalive(struct sock *sk, int  val)

TCP Write Timer

The TCP write timer serves two send-side timer purposes. The first purpose is the retransmission timer, which is set to the maximum time to wait for an acknowledgment after sending data. The other purpose is the window probe timer. Window probes are periodically sent from the send side of the connection once a zero window size is received from the peer. The probes are sent periodically to see if the window size has been increased, and the window probe timer is set to the maximum time to wait for a response to the window probe. The retransmission and zero window conditions do not occur simultaneously; therefore, both functions can be implemented with the same timer. The retransmission timer is set after sending any segment containing data, but the window probe timer is set after receiving an acknowledgment from the receiver with a window size of zero. Stevens devotes considerable time to the explanations of these two timers in two chapters. “TCP Timeout and Retransmission” discusses the retransmission timer in detail, and “TCP Persist Timer” discusses the window probe timer [STEV94].

When the timer expires, the function tcp_write_timer defined in file linux/net/ipv4/tcp_timer.c is executed with the argument, data, which points to the socket containing the newly expired timer.

static void tcp_write_timer(unsigned long data)

Sk is set to the sock structure pointed to by data, and tp is set to the TCP options structure in the sock structure.

struct sock *sk = (struct sock*)data;
struct tcp_opt *tp = tcp_sk(sk);
int event;

For safety, the socket is locked. If the socket is locked, we want to try again later so we set the timer value to 20 seconds and return.

if (sock_owned_by_user(sk)) {
goto out_unlock;

The socket state sk->state and the quick acknowledgment state tp->pending are checked to see if the socket is closed and there are no outstanding segments that have not yet been acknowledged.

if (sk->state == TCP_CLOSE ||!tp->pending)
goto out;

If the timeout value is still in the future, the retransmit timer is reset to the value in the timeout field in the options structure. A zero return from mod_timer indicates that the timer was still pending. If the timer was still pending, the socket reference count is incremented and the function returns.

if (time_after(tp->timeout, jiffies)) >  0) {
if (!mod_timer(&tp->retransmit_timer, tp->timeout))
goto out;

As discussed earlier, the write timer serves two purposes. It is used as the retransmit timer, TCP_TIME_RETRANS, or the window probe timer, TCP_TIME_PROBE0. We determine which role we are performing by looking at the pending field in the TCP options structure. If the current timeout is a retransmission timer expiring, we call the function tcp_retransmit_timer, but if the current timeout is a window prove timer expiring, we call tcp_probe_timer. These timer functions are discussed in the next two sections.

event = tp->pending;
tp->pending = 0;
switch (event) {

TCP Retransmit Timer

Tcp_retransmit_timer function is called when the retransmit timer expires, indicating that an expected acknowledgment was not received. The function is invoked from the general TCP write timer function, tcp_write_timer, which was discussed earlier in Section

static void tcp_retransmit_timer(struct sock*sk)
struct tcp_opt *tp = tcp_sk(sk);
if (tp->packets_out == 0)
goto out;

Next, we check to see if the connection should be timed out or if the sender has reduced the window size to zero with the socket still in the ESTABLISHED state. If a zero window has been received and the timestamp in the received packet indicates that the received packet is older than the maximum retransmit time, the next time the write timer is called it will become a zero window probe timer. The Congestion Avoidance (CA) processing state or loss state is entered by calling tcp_enter_loss.

if (tp->snd_wnd == 0 &&  !sk->dead &&
#ifdef TCP_DEBUG
if (net_ratelimit()) {
struct inet_opt *inet = inet_sk(sk);
printk(KERN_DEBUG "TCP: Treason  uncloaked!
Peer %u.%u.%u.%u:%u/%u shrinks window %u:%u.Repaired. n",
inet->num, tp->snd_una,tp->snd_nxt);

If the received timestamp has aged more than two minutes, we indicate a write error, time out the socket, and drop the socket and the connection.

if (tcp_time_stamp - tp->rcv_tstamp >TCP_RTO_MAX) {
goto out;

The second parameter, now, of tcp_enter_loss is set to zero indicating that the loss state is being entered from a retransmit timeout.  Next, we call tcp_retransmit_timer to try to retransmit the skb, which is at the head of the write queue by calling tcp_retransmit_skb. At this point, the connection is in a dubious state if the peer hasn’t disappeared altogether, so __sk_dst_reset is called to reset the destination cache..

tcp_enter_loss(sk, 0);
goto out_reset_timer;

Next, we call tcp_write_timeout to see if there has been a sufficient number of retries and to complete the processing for the last retry attempt

if (tcp_write_timeout(sk))
goto out;

In the following section of code we increment the TCP statistics in the /proc filesystem. We check the congestion avoidance state, ca_state, field in the TCP option structure to see what sort of failure triggered the retransmit timeout. For more information about how to access TCP/IP statistics in /proc.

if (tp->retransmits == 0) {
if (tp->ca_state == TCP_CA_Disorder ||
tp->ca_state == TCP_CA_Recovery) {
if (tp->sack_ok) {
if (tp->ca_state == TCP_CA_Recovery)
} else {
if (tp->ca_state == TCP_CA_Recovery)
} else if (tp->ca_state == TCP_CA_Loss){
} else {

Now, the loss state is entered, congestion avoidance processing is initiated, and a retransmit is attempted.

if (tcp_use_frto(sk)) {
} else {
tcp_enter_loss(sk, 0);
if(tcp_retransmit_skb(sk,skb_peek(&sk->write_queue))> 0){

If tcp_retransmit_skb returned a value greater than zero, it is because the low-level IP transmit function failed due to a local transmission problem such as a busy device driver. The retransmit timer must be reset to try again later.

if (!tp->retransmits)
tcp_reset_xmit_timer(sk, TCP_TIME_RETRANS,min(tp->rto,
goto out;

The retransmit timeout, rto, in the TCP options structure is increased each time a retransmit occurs; it is actually doubled each time. The maximum value of the retransmit timeout is TCP_RTO_MAX, or 120 seconds, which is also the maximum value for RTT. The doubling of the retransmit timer is suggested by Van Jacobson in his paper on congestion avoidance [JACOB88]. The Round-Trip Time estimate (RTT) is not changed by the timeout. If the number of retransmits is greater than TCP_RETR1, which is 3, the destination cache is aged out.

tp->rto = min(tp->rto << 1,TCP_RTO_MAX);
tcp_reset_xmit_timer(sk, TCP_TIME_RETRANS,tp->rto);
if (tp->retransmits >  sysctl_tcp_retries1)

Window Probe Timer

The window probe timer, tcp_probe_timer, also in file linux/net/ipv4/tcp_timer.c, is called from the generic TCP write-side timer function, tcp_write_timer. As shown in Section, when the pending event is a window probe timeout, tcp_probe_timer is called. The zero window timer is set when this side of the connection sends out a zero window probe in response to a zero window advertisement from the peer. We arrive at this function because the timer expired before a response was received to the zero window probe.

Tp is set to point to the TCP options structure for the sock structure sk. First, tcp_probe_timer sees if there is a valid zero window event, which can occur only if the last packet sent was a zero window probe. The packet_out and the send_head fields in the TCP options structure are checked for outstanding unacknowledged data or data segments in the process of being sent. If either of these conditions exists, we return because this can’t be a valid zero window probe event.

static void tcp_probe_timer(struct sock *sk)
struct tcp_opt *tp = tcp_sk(sk);
int max_probes;
if (tp->packets_out||!tp->send_head)  {
tp->probes_out = 0;

We set the maximum number of probes out, max_probes, to the value TCP_RETR2, which is defined as the number 15 in the file linux/include/net/tcp.h. We do a check to see if the socket is orphaned. Linux does not kill off the connection merely because a zero window size has been advertised for a long time, so we want to see if we need to keep the connection alive. Therefore, we check to see if the next retransmission timeout value, rto, will still be less than the maximum RTO, TCP_RTO_MAX. Max_probes is recalculated for the orphaned socket. A call to tcp_out_of_resources checks to make sure that the orphaned socket does not consume too much memory, as it should not be kept alive forever. If the socket is to be killed off, tcp_probe_timer returns here.

max_probes = sysctl_tcp_retries2;
if (sk->dead) {
int alive =((tp->rto<<tp->backoff) < TCP_RTO_MAX);
max_probes = tcp_orphan_retries(sk, alive);
if (tcp_out_of_resources
(sk, alive||tp->probes_out<= max_probes))

Next, we check the number of outstanding zero window probes to see if it has exceeded the maximum number of probes in max_probes, which was a value cal culated previously. If so, an error condition is indicated and the function returns. Another window probe is sent only if all the previous checks have passed.

if (tp->probes_out > max_probes) {
} else {

Delayed Acknowledgment Timer

The purpose of delayed acknowledgment is to minimize the number of separate ACKs that are sent. The receiver does not send an ACK as soon as it can. Instead, it holds on to the ACK in the hopes of piggybacking the ACK on an outgoing data packet. The delayed acknowledgment timer is set to the amount of time to hold the ACK waiting for outgoing data to be ready. The function, tcp_ delack_ timer, defined in file, linux/net/ipv4/tcp_timer.c, is called when the delayed acknowledgment timer expires, indicating that we have now given up finding an outgoing packet to carry our ACK. The timer value is maintained between a minimum value, TCP_ DELACK_ MIN, defined as 1/25 second, and a maximum value, TCP_ DELACK_ MAX, defined as 1/5 second. As in the other TCP timer functions, tcp_delack_timer is called with a pointer to the sock structure for the current open socket.

static void tcp_delack_timer(unsigned long  data)
struct sock *sk = (struct sock*)data;
struct tcp_opt *tp = tcp_sk(sk);

If the socket is locked, the timer is set ahead and an attempt is made later.

if (sock_owned_by_user(sk)) {
tp->ack.blocked = 1;
goto out_unlock;

Tcp_mem_reclaim accounts for reclaiming memory from any TCP pages allocated in queues. If the socket is in the TCP_CLOSE state and there is no pending acknowledge event, we exit the timer without sending an ACK. Next, we check to see if we got here somehow even though the timer has not yet expired, in which case we exit.

if (sk->state == TCP_CLOSE||!(tp->ack.pending&TCP_ACK_TIMER))
goto out;
if ((long)(tp->ack.timeout - jiffies) >  0) {
if (!mod_timer(&tp->delack_timer,tp->ack.timeout))
goto out;
if (time_after(tp->ack.timeout, jiffies))  {
if (!mod_timer(&tp->delack_timer,tp->ack.timeout))
goto out;

Since the delayed ACK timer has fired, the pending timer event can be removed from the acknowledgment structure to indicate we are done processing the event.

tp->ack.pending &= ~TCP_ACK_TIMER; 

The field prequeue points to incoming packets that have not been processed yet. Since this timer went off before we could acknowledge these packets, they are put back on the backlog queue for later processing, and failure statistics are incremented for each of these packets.

if (skb_queue_len(&tp->ucopy.prequeue)) {
struct sk_buff *skb;
while((skb =__skb_dequeue(&tp->ucopy.prequeue))!= NULL)
sk->backlog_rcv(sk, skb);
tp->ucopy.memory = 0;

ere, we check to see if there is a scheduled ACK.

if (tcp_ack_scheduled(tp)) {

If we have a scheduled ACK, it means that the timer expired before the delayed ACK could be sent out. We check the current acknowledgment mode in the pingpong field of the ack structure. If it is zero, we must be in quick acknowledgment mode, so we inflate the value of the acknowledgment timeout (ATO) in the ato field. This increases the amount of time until the next ACK timeout expires. However, if we are in delayed acknowledgment mode, we decrease the ATO to the minimum amount, TCP_ATO_MIN. In addition, we switch to fast acknowledgment mode by turning off delayed acknowledgment in the pingpong field. This will force the next ACK to go out as soon as TCP_ATO_MIN time elapses without waiting for an outgoing data segment to carry the ACK.

if (!tp->ack.pingpong) {
tp->ack.ato = min(tp->ack.ato <<1,tp->rto);
} else {
tp->ack.pingpong = 0;
tp->ack.ato = TCP_ATO_MIN;

Finally, we send the ACK and increase the delayed ACK counter. Next, we clean up and exit the timer.

if (tcp_memory_pressure)

Keepalive Timer

This timer function is actually used for two separate purposes. In addition to providing the keepalive timeout function, it is also used as a SYN acknowledge timer by a socket in a listen state. The keepalive timer function can serve these two separate purposes because keepalives are only sent for a connection that has already been established, but the SYN acknowledge timer is only active for connections in the LISTEN state.

TCP normally does not perform any keepalive function; keepalive polling is not part of the TCP specification. The keepalive is added outside the TCP specification for the use of some TCP application layer servers for protocols that don’t do any connection polling themselves. For example, the telnet daemon sets the keepalive mode. Keepalive is enabled via the socket option SO_KEEPALIVE, and the timeout value associated with the timer is maintained in the keepopen field of the sock structure. It can also be set by using the system control value, sysctl_tcp_keepalive_time. The default value for the keepalive time is in the constant, TCP_KEEPALIVE_INTVL, defined in linux/include/net/tcp.h to 75 seconds. The keepalive timer function, tcp_keepalive_timer in file linux/net/ipv4/tcp_timer.c, is called when the timeout value expires. As with the other TCP timer functions discussed in this section, a pointer to the sock structure is passed in via the parameter data. Tp is set to point to the tcp_opt structure in the sock.

static void tcp_keepalive_timer(unsigned long data)
struct sock *sk = (struct sock *) data;
struct tcp_opt *tp = tcp_sk(sk);
__u32 elapsed;

If the socket is currently in use, there is no need for keepalives, so we reset the timer value to 20 seconds and leave.

if (sk->lock.users) {
tcp_reset_keepalive_timer (sk, HZ/20);
goto out;

Next, we must determine the state of the connection to see if the socket is in the LISTEN, FIN_WAIT2, or TCP_CLOSE connection state. This timer function is also used to maintain the SYNs and SYN_ACKs when a socket is in the LISTEN state. If the connection is in the LISTEN state, tcp_synack_timer is called to process SYN acknowledgments or the lack of them as the case may be.

if (sk->state == TCP_LISTEN) {
goto out;

If the socket is in the FIN_WAIT2 state and the keepalive option is set, we want to do an abortive release of the connection when the timer expires instead of letting the connection terminate in an orderly fashion. A positive value in the linger2 field of the TCP options structure tells us that the TCP_LINGER2 socket option was set for this socket, so we want to maintain the connection, which is in FIN_WAIT2 in an "undead" state before terminating the connection. If the linger2 field is less than zero, we shut down the connection immediately. Otherwise, the connection is aborted by calling tcp_send_active_reset, which sends the peer a RST segment.

if (sk->state == TCP_LISTEN) {
goto out;
if(sk->state == TCP_FIN_WAIT2&&sock_flag
if (tp->linger2 >= 0) {
int tmo = tcp_fin_time(tp) -  TCP_TIMEWAIT_LEN;
if (tmo > 0) {
tcp_time_wait(sk, TCP_FIN_WAIT2, tmo);
goto out;
tcp_send_active_reset(sk, GFP_ATOMIC);
goto death;

Next, we see if the connection state is in the CLOSED state and the keepalive mode is not set (keepopen field in the sock structure). If not, we get the value of the keepalive timer.

if (!sock_flag(sk, SOCK_KEEPOPEN)||
sk->state == TCP_CLOSE) goto out;
elapsed = keepalive_time_when(tp);

We check to see if the connection is actually alive and sending data. If it is, we just reset the keepalive timer without sending out the keepalive probe.

if (tp->packets_out || tp->send_head)
goto resched;

We are ready to send the keepalive as long as the elapsed time since the last probe is greater than the timeout value. The probes_out field in the TCP options structure is updated and the probe is sent by calling tcp_write_wakeup, which will wake up the socket and send the queued keepalive probe.

elapsed = tcp_time_stamp - tp->rcv_tstamp;
if (elapsed >= keepalive_time_when(tp)) {
if ((!tp->keepalive_probes &&  tp->probes_out >=
sysctl_tcp_keepalive_probes) ||
(tp->keepalive_probes &&  tp->probes_out >=
tp->keepalive_probes)) {
tcp_send_active_reset(sk, GFP_ATOMIC);
goto out;

If tcp_write_wakeup sent the probe out successfully, the probes_out counter is incremented and the configured keepalive timer value is reset to the configured value.

if (tcp_write_wakeup(sk) <= 0) {
elapsed = keepalive_intvl_when(tp);
} else {

If the probe was not sent, the keepalive timer value is reset to 1BC2 second.

} else {
elapsed = keepalive_time_when(tp) - elapsed;

This is where the timer is reset to get ready for sending the next keepalive probe.

tcp_reset_keepalive_timer (sk, elapsed);
goto out;