registration
Linux

Linux

This course contains the basics of Linux

Course introduction
Test Your Caliber
Interview Questions
Pragnya Meter Exam

Linux

Receiving Data in the Transport Layer, UDP, and TCP

In this, we show what happens when a packet arrives in the transport layer. We cover the UDP protocol, which processes packets for SOCK_DGRAM type sockets, and the TCP protocol, which processes packets for SOCK_STREAM type sockets. As we saw in "Sending the Data from the Socket through UDP and TCP," UDP is far simpler than TCP because it doesn’t handle any state information, so the majority of this chapter is devoted to TCP. However, in this chapter, we discuss the packet handling for both protocols. Since TCP is far more complicated than UDP, the discussion includes handling of the queues and processing of input TCP segments in the various states. In general, we continue the discussion of earlier chapters by showing how each transport layer protocol works by following the data as it flows up from the network layer, IP through the transport layer, and up into the receiving socket. The flow of input packets is straightforward for UDP and can be discussed separately from the flow of output packets. This is because UDP is limited to individual packet processing and doesn’t have to maintain as much internal state information. However, with TCP, since the received packet flow depends on the internal state of the protocol, it isn’t possible to completely separate the discussion of the receive side from the send side.

As is the case with the send side, all the user-level read functions are converted into a single function at the socket layer. In UDP, the function doing the work is udp_recvmsg, and for TCP, it is tcp_recvmsg. The bulk of this chapter covers these two functions and other associated processing. UDP is a simpler case because packet reception in UDP essentially consists of little more than checksum calculation and copying of the packet data form kernel space to user space cover the UDP-specific receive packet handling. TCP is far more complex, and the receive side must ensure coordination with the sendside processing, TCP state handling, and keeping track of acknowledgments. This cover the processing of received packets in TCP.

Receive-Side Packet Handling

The transport layer protocols—TCP, UDP, and other member protocols in the AF_INET protocol family—all receive packets form the network layer, IP. As we saw in, "The Network Layer, IP," when IP is done processing an input packet it dispatches the packet to a higher-layer protocol’s receive function by decoding the 1-byte protocol field in the IP header. It uses a hash table to determine whether TCP, UDP, or some other protocol should receive the packet.

In, “The Linux TCP/IP Stack,” we discussed the hash tables, the inet_protocol structure, and how incoming IP packets are de-multiplexed. we discussed the flow of incoming packets through the IP input routine. In this chapter, we will start following the packets as soon as they arrive in each of the two transport layer protocols’ handler functions.

Receiving a Packet in UDP

we saw how protocols register with the AF_ INET family by calling the inet_add_protocol function in linux /net /ipv4 /protoco l.c. This initialization step happens when the AF_ INET initialization function, inet_ init, in the file linux /net /ipv4 /af_ inet.c, runs at kernel startup time. In the inet_ protocol structure for UDP, the value in the protocol field is IPPPROTO_ UDP, and the function defined in the handler field is udp_ rcv. This is the function that gets called for all incoming UDP packets passed up to us by IP.

UDP Receive Handler Function, udp_rcv

Udp_rcv in file linux /net /ipv4 /udp.c is the first function in UDP that sees a UDP input packet after IP is done with it. This function executed when the protocol field in the IP header is IPPPROTO_ UDP, or the value 17.

int udp_rcv(struct sk_buff *skb)
{

The variable, sk, will point to the socket that gets this packet if there is a socket open on this port. Uh points to the UDP header. Rt gets the routing table entry from the destination cache.

struct sock *sk;
struct udphdr *uh;
unsigned short ulen;
struct rtable *rt = (struct  rtable*)skb->dst;
u32 saddr = skb->nh.iph->saddr;
u32 daddr = skb->nh.iph->daddr;
int len = skb->len;
IP_INC_STATS_BH(IpInDelivers);

The function pskb_ may_ pull is called to confirm that there is sufficient space in the socket buffer to hold the UDP header. Uh points to the UDP header, and ulen is the value of the length field in the UDP header.

if (!pskb_may_pull(skb, sizeof(struct udphdr)))
goto no_header;
uh = skb->h.uh;
ulen = ntohs(uh->len);

We check to ensure that the skb is sufficiently long to contain a complete UDP header. Pskb_trim checks the packet length in the process of setting tail to point to the end of the UDP header.

if (ulen > len || ulen < sizeof(*uh))
goto short_packet;
if (pskb_trim(skb, ulen))
goto short_packet;

The UDP checksum is begun. The ip_summed field in skb is set depending on whether it is necessary to calculate the checksum.

if (udp_checksum_init(skb, uh, ulen, saddr, daddr) < 0)
goto csum_error;

The routing table entry for this incoming packet is checked to see if it was sent to a broadcast or multicast address. If so, we call udp_v4_mcast_deliver to complete the processing.

if(rt->rt_flags & (RTCF_BROADCAST|RTCF_MULTICAST))
return udp_v4_mcast_deliver(skb, uh, saddr, daddr);

Udp_v4_lookup determines if there is an open socket on the UDP port in this packet’s header by searching the UDP hash table. If there is an open socket, the packet is passed on to the socket’s receive queue and we are done.

if (sk != NULL) {
int ret = udp_queue_rcv_skb(sk, skb);
sock_put(sk);

If the return value is greater than zero, we must tell the caller to resubmit the input packet.

if (ret > 0)
return -ret;
return 0;
}
if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN,  skb))
goto drop;

If there is no socket for this packet, the checksum calculation is completed. The packet is dropped silently if there is a checksum error. If it is a valid packet but has been sent to a port for which there is no open socket, it is processed as a port- unreachable packet. The UdpNoPorts count is incremented, an ICMP port- unreachable message is sent to the sending machine, and the packet is freed. We are done.

if (udp_checksum_complete(skb))
goto csum_error;
UDP_INC_STATS_BH(UdpNoPorts);
icmp_send(skb, ICMP_DEST_UNREACH,ICMP_PORT_UNREACH, 0);
kfree_skb(skb);
return(0);

At this point, all that is left is to process the bad packet errors detected earlier.

short_packet:
NETDEBUG(if (net_ratelimit())
printk(KERN_DEBUG"UDP: short packet: %u.%u.%u.%u:%u %d/%d to %u.%u.%u.%u:%u n",
NIPQUAD(saddr),
ntohs(uh->source),
ulen,len,
NIPQUAD(daddr),
ntohs(uh->dest)));
no_header:
UDP_INC_STATS_BH(UdpInErrors);
kfree_skb(skb);
return(0);
csum_error:

Even though the packet is discarded silently, we still increment the UDP error statistics.

NETDEBUG(if (net_ratelimit())
printk(KERN_DEBUG"UDP: bad checksum. From %d.%d.%d.%d:%d  to
%d.%d.%d.%d:%d ulen %d n",
NIPQUAD(saddr),
ntohs(uh->source),
NIPQUAD(daddr),
ntohs(uh->dest),ulen));
drop:
UDP_INC_STATS_BH(UdpInErrors);
kfree_skb(skb);
return(0);
}

Receiving Multicast and Broadcast Packets in UDP

Multicast and broadcast packets are sent to multiple destinations. In fact, there may be multiple destinations in the same machine. When UDP receives a multicast or broadcast packet, it checks to see if there are multiple open sockets that should receive the packet. When the routing table entry for the incoming packet has the multicast or broadcast flags set, the UDP receive function, udp_ rcv, calls the UDP multicast receive function, udp_ v4_ mcast_deliver in fil linux /net /ipv4 /udp.,c to distribute the incoming packet to all valid listening sockets.

static int udp_v4_mcast_deliver
(struct sk_buff *skb, struct udphdr *uh,
u32 saddr, u32 daddr)
{
struct sock *sk;
int dif;

First, we lock the UDP hash table. Then, we select the first socket in the hash table that is open on a port matching the destination port in the header of the incoming packet.

read_lock(&udp_hash_lock);
sk = sk_head(&udp_hash[ntohs(uh->dest)& (UDP_HTABLE_SIZE - 1)]);
dif = skb->dev->ifindex;
sk = udp_v4_mcast_next
(sk, uh->dest, daddr, uh->source, saddr, dif);
if (sk) {
struct sock *sknext = NULL;

We loop through the hash table checking each potential matching entry. When an appropriate listening socket is found, the socket buffer, skb, is cloned and the new buffer is put on the receive queue of the listening socket by calling udp_ queue_ rcv_ skb, which is the UDP backlog receive function. If there is no match, the packet is silently discarded. There is no statistics counter to increment for discarded multicast and broadcast received packets.

do {
struct sk_buff *skb1 = skb;
sknext = udp_v4_mcast_next(sk->next,uh->dest, daddr,
uh->source, saddr, dif);
if(sknext)
skb1 = skb_clone(skb, GFP_ATOMIC);
if(skb1)
int ret = udp_queue_rcv_skb(sk, skb1);
if (ret > 0)

The comments say that we should be reprocessing packets instead of dropping them here. However, as of this kernel revision, we are not.

kfree_skb(skb1);
sk = sknext;
} while(sknext);
} else
kfree_skb(skb);
read_unlock(&udp_hash_lock);
return 0;
}

UDP Hash Table

As we know from earlier chapters, an application opens a socket of type SOCK_DGRAM to receive UDP packets. This type of socket may listen on a variety of address and port combinations. For example, it may want to receive all packets sent from any address but with a particular destination port, or it may want to receive packets sent to multicast or broadcast addresses. A packet arriving in the udp_rcv function may be passed on to multiple listening sockets. It is not sufficient for UDP to determine which socket or sockets should receive the packet solely by matching the destination port. To determine which socket should get an incoming packet, Linux uses a UDP hash table, udp_hash, defined in the file udp.c.

struct sock *udp_hash[UDP_HTABLE_SIZE];

The hash table can contain up to 128 slots, and each location in the hash table points to a list of sock structures. The hash value used for the table index is calculated from the lowest 7 bits of the UDP port number.

In linux/net/ipv4/udp.c, Udp_v4_lookup is a utility function that looks up a socket in the hash table based on matching source and destination ports, source and destination addresses, and network interface. It locks the hash table and calls udp_v4_lookup_longway to do the real work.

__inline__ struct sock *udp_v4_lookup
(u32 saddr, u16 sport, u32 daddr, u16 dport, int dif)
{
struct sock *sk;
read_lock(&udp_hash_lock);
sk = udp_v4_lookup_longway(saddr, sport, daddr, dport, dif);
if (sk)
sock_hold(sk);
read_unlock(&udp_hash_lock);
return sk;
}

The next function, Udp_v4_lookup_longway, is called from udp_v4_lookup. It is declared in the file linux /net /ipv4 /udp.c. It tries to find the socket as best it can. Minimally, it matches the destination port number. Then, it tries to refine the choice further by matching the source address, destination address, and the incoming network interface.

struct sock *udp_v4_lookup_longway
(u32 saddr, u16 sport, u32 daddr, u16
dport, int dif)
{
struct sock *sk, *result = NULL;
struct hlist_node *node;
unsigned short hnum = ntohs(dport);
int badness = -1;
sk_for_each(sk, node, &udp_hash[hnum &(UDP_HTABLE_SIZE - 1)]) {

Our idea here is to find the best match of the incoming packet with an open socket. First, we check to make sure that it is not an IPv6-only socket. As discussed in, “Linux Sockets,” IPv6 sockets may generally be used for IPv4. In this section of code, we find the first matched socket, and this is the one that gets the packet.

struct inet_opt *inet = inet_sk(sk);
if (inet->num == hnum &&  !ipv6_only_sock(sk)) {

First, we check that the address family matches.

int score = (sk->sk_family == PF_INET ? 1  : 0);
if (inet->rcv_saddr) {
if (inet->rcv_saddr != daddr)
continue;
score+=2;
}

Now we check for a matching destination address.

if (inet->daddr) {
if (inet->daddr != saddr)
continue;
score+=2;
}

We check for a matching port.

if (inet->dport) {
if (inet->dport != sport)
continue;
score+=2;
}

Finally, we check to see if the input network interface matches (but only if it is bound).

if (sk->sk_bound_dev_if) {
if (sk->sk_bound_dev_if != dif)
continue;
score+=2;
}
if(score == 9) {
result = sk;
break;
} else if(score > badness) {
result = sk;
badness = score;
}
}
}
return result;
}

UDP Backlog Receive

The function, udp_queue_rcv_skb, in the file linux/net/ipv4/udp.c is the backlog receive function for UDP, called from the "bottom half" when the socket is held by the user and can’t accept any more data. This function is initialized at compile time into the backlog_rcv field of the UDP proto structure, udp_prot. As we saw, udp_queue_rcv_skb is called by udp_rcv to complete input packet processing, and its purpose is mainly to place incoming packets on the socket’s receive queue and increment the UDP input statistics.

static int udp_queue_rcv_skb(struct sock * sk, struct sk_buff *skb)
{
struct udp_opt *up = udp_sk(sk);
if (!xfrm4_policy_check(sk, XFRM_POLICY_IN,  skb)) {
kfree_skb(skb);
return -1;
}

First, we check to see if this is an encapsulated socket for IPSec.

if (up->encap_type) {

If we have an encapsulated socket, the incoming packet must be encapsulated. If it is, we transform the input packet. If not, we assume it is an ordinary UDP packet and we fall through.

int ret;
ret = udp_encap_rcv(sk, skb);
if (ret == 0) {
kfree_skb(skb);
return 0;
}
if (ret < 0) {

Here, we process the ESP packet.

ret = xfrm4_rcv_encap(skb, up->encap_type);
UDP_INC_STATS_BH(UdpInDatagrams);
return -ret;
}

If we fall through, the packet must be an ordinary UDP packet.

}

Now, we finish calculating the checksum.

if (sk->sk_filter &&  skb->ip_summed != CHECKSUM_UNNECESSARY) {
if (__udp_checksum_complete(skb)) {
UDP_INC_STATS_BH(UdpInErrors);
kfree_skb(skb);
return -1;
}
skb->ip_summed = CHECKSUM_UNNECESSARY;
}

Next, we call the socket-level receive function, sock_queue_rcv_skb,  which places the packet on the socket’s receive queue and wakes up the socket so the application can read the data. If there is insufficient room on the receive queue, the error statistics are incremented and the packet is silently discarded.

if (sock_queue_rcv_skb(sk,skb)<0) {

If the socket returned an error, we increment the statistics and get out.

UDP_INC_STATS_BH(UdpInErrors);
IP_INC_STATS_BH(IpInDiscards);
ip_statistics[smp_processor_id()*2].IpInDelivers—;
kfree_skb(skb);
return -1;
}

Once we increment the counter, we are done. At this point, the socket-level processing will allow the user-level read to complete.

UDP_INC_STATS_BH(UdpInDatagrams);
return 0;
}

UDP Socket-Level Receive

we saw how all the socket send calls are converged at the socket layer into one function at the transport layer. On the send side, all the socket-level send functions converge on one function. For UDP, this function is udp_sendmsg. As is the case with the send side, when the application calls any of the read functions on an open socket, the socket layer calls the function pointed to by the rcvmsg field in the prot structure. At compile time, the rcvmsg field is initialized to udp_rcvmsg. Udp_rcvmsg, in the file linux/net/ipv4/udp.c, is the receiving function that is executed for all SOCK_DGRAM type sockets.

int udp_recvmsg(struct sock *sk,
 struct msghdr *msg, int len,
int noblock, int flags, int *addr_len)
{
struct inet_opt *inet = inet_sk(sk);
struct sockaddr_in *sin =
 (struct sockaddr_in  *)msg->msg_name;
struct sk_buff *skb;
int copied, err;

Set the application’s address length argument. Check to see if there are any messages in the error queue of the socket, sk, and if so, process them by calling ip_recv_error and going out.

if (addr_len)
*addr_len=sizeof(*sin);
if (flags & MSG_ERRQUEUE)
return ip_recv_error(sk, msg, len);

De-queue packets from the socket sk’s receive queue by calling the generic datagram receive queue function, skb_recv_datagram, and if it returns without a packet, get out.

skb = skb_recv_datagram(sk, flags, noblock,  &err);
if (!skb)
goto out;

Check to see if the user is asking for more data than the payload in the packet, skb.

copied = skb->len - sizeof(struct udphdr);
if (copied > len) {
copied = len;
msg->msg_flags |= MSG_TRUNC;
}

The ip_summed field in the socket buffer determines if checksums are required. If ip_summed is not equal to CHECKSUM_ UNNECESSARY, checksum calculation must be done, and data is copied from kernel space to user space while calculating a checksum.

The copying and checksum calculation is done by skb_copy_and_csum_datagram_iovec. However, if a checksum is not required, skb_copy_datagram_iovec is the function called to do the copying. It is also possible that a partial checksum was calculated because the data spanned more than one buffer, and in this case, __udp_checksum_complete finishes the checksum calculation. A checksum is the amount that makes an integer sum always add up to zero. Checksums can be done partially and finished later.

if (skb->ip_summed==CHECKSUM_UNNECESSARY)  {
err = skb_copy_datagram_iovec(skb,sizeof(struct udphdr),
msg->msg_iov, copied);
} else if (msg->msg_flags&MSG_TRUNC) {
if (__udp_checksum_complete(skb))
goto csum_copy_err;
err = skb_copy_datagram_iovec(skb,sizeof(struct udphdr),
msg->msg_iov,
copied);
} else {
err = skb_copy_and_csum_datagram_iovec
(skb,sizeof(struct udphdr),
msg->msg_iov);
if (err == -EINVAL)
goto csum_copy_err;
}
if (err)
goto out_free;

Next, the incoming packet is timestamped. If the user supplied a valid buffer, sin, to receive the packet’s source address and port, the information is copied from the packet header.

sock_recv_timestamp(msg, sk, skb);
if (sin)
{
sin->sin_family = AF_INET;
sin->sin_port = skb->h.uh->source;
sin->sin_addr.s_addr = skb->nh.iph->saddr;
memset(sin->sin_zero, 0, sizeof(sin->sin_zero));
}

Before leaving, udp_recv checks the control message flags field, cmsg_flags, to see if any IP socket options are set. For example, certain socket options such as IP_TOS require parts of the IP header to be copied into user space. If there are any flags set, ip_cmsg_recv retrieves the associated option values.

if (sk->protinfo.af_inet.cmsg_flags)
ip_cmsg_recv(msg, skb);
err = copied;

There are three error exits at the end of udp_rcv. The most interesting is csum_copy_err.

out_free:
skb_free_datagram(sk, skb);
out:
return err;
csum_copy_err:
UDP_INC_STATS_BH(UdpInErrors);

When the flags argument is set to MSG_PEEK, it means that the caller wants to "peek" at the incoming message without removing it from the receive queue. However, we know the packet has a bad checksum, so we must remove it from the receive queue before deleting it.

if (flags&MSG_PEEK) {
int clear = 0;
spin_lock_irq(&sk->receive_queue.lock);
if (skb ==  skb_peek(&sk->receive_queue)) {
__skb_unlink(skb, &sk->receive_queue);
clear = 1;
}
spin_unlock_irq(&sk->receive_queue.lock);
if (clear)
kfree_skb(skb);
}
skb_free_datagram(sk, skb);
return -EAGAIN;
}

Receiving Data in TCP

we showed how the IP protocol dispatches the UDP receive function based on the protocol field of the IP header.. As with UDP, TCP also initializes an instance of the inet_protocol structure. This structure allows the network layer, IP, to dispatch transport layer handler functions based on the value in the protocol field of the IP header without needing to know anything about the internal structure of each transport layer protocol. It is helpful for a good understanding of TCP to be able to visualize the state management. Figure shows the TCP receive-side state machine.

TCP receive-side state diagram.

Some key flag definitions are used primarily on the receive side of TCP (see Table). These defines, found in file linux/net/ipv4/tcp_input.c, are primarily used to govern the state processing during the receive side of the TCP connection. The flags are shown in Table 10.1. They are also used as part of the implementation of the slow start and congestion avoidance, selective acknowledgment, and fast acknowledgment, but should not be confused with the flags in the TCP control buffer,.

Receive State Flags

TCP Receive Handler Function, tcp_v4_rcv

In this section, we will examine the TCP input segment handling and the registered handler function for the TCP protocol in the AF_INET protocol family. Figure shows the TCP receive packet flow.

TCP receive packet flow.

As is the case with all other member protocols in the AF_INET family, TCP associates a handler function with the protocol field value, IPPPROTO_TCP, (the value six) by initializing an instance of the inet_protocol structure. This process is described in Chapter 6. The handler field is set to the function tcp_v4_rcv. Therefore, tcp_v4_rcv, defined in file linux/net/ipv4/tcp_ipv4.c, is called from IPv4 when the protocol type in the IP header contains the protocol number for TCP.

int tcp_v4_rcv(struct sk_buff *skb)
{
struct tcphdr *th;
struct sock *sk;
int ret;
if (skb->pkt_type!=PACKET_HOST)
goto discard_it;

TCP counters are incremented before TCP checksums are validated. The next section of code is validating the TCP, checking that it is complete and that the header length field is at least as big as the TCP header without TCP options. As is the case with UDP, we use the socket buffer utility function pskb_may_pull to ensure that the socket buffer, skb, contains a complete header.

TCP_INC_STATS_BH(TcpInSegs);
if (!pskb_may_pull(skb, sizeof(struct  tcphdr)))
goto discard_it;

There is set to point to the TCP header in the SKB. The doff field is the 4-bit header length.

th = skb->h.th;
if (th->doff < sizeof(struct tcphdr)/4)
goto bad_packet;
if (!pskb_may_pull(skb, th->doff*4))
goto discard_it;

Here, the checksum is initialized. The TCP header and the IP fake header is used to initialize the
checksum. The rest of the checksum calculation is put off until later.

if ((skb->ip_summed !=  CHECKSUM_UNNECESSARY &&
tcp_v4_checksum_init(skb) < 0))
goto bad_packet;

A few fields are extracted form the TCP header and updated in the TCP control buffer part in the skb. TCP will want quick access to these values in the TCP packet header to do header prediction, which chooses incoming packets fast path processing. The fields used for header prediction include the TCP sequence number, seq in the control buffer, and the TCP acknowledgment number, ack_seq, in the control buffer. The end_seq is the position of the last byte in the incoming segment. At this point, it is set to the received sequence number, seq, plus the data length in the segment, plus one if this is a SYN or FIN packet. Two other fields in the control buffer, when and sacked, are set to zero. When is used for RTT calculation, and sacked is for selective acknowledgment. The macro TCP_SKB_CB is used to get the pointer to the TCP control buffer from the socket buffer..

th = skb->h.th;
TCP_SKB_CB(skb)->seq = ntohl(th->seq);
TCP_SKB_CB(skb)->end_seq =
 (TCP_SKB_CB(skb)->seq + th->syn + th->fin +
skb->len - th->doff*4);
TCP_SKB_CB(skb)->ack_seq =  ntohl(th->ack_seq);
TCP_SKB_CB(skb)->when = 0;
TCP_SKB_CB(skb)->flags = skb->nh.iph->tos;
TCP_SKB_CB(skb)->sacked = 0;

Now we try to find an open socket for this incoming segment if there is one. As is the case with most functions in Linux, TCP/IP, the double underscore “__” before tcp_v4_lookup means that the caller must acquire the lock before calling the function. Tcp_v4_lookup attempts to find the socket based on the incoming network interface, the source and destination IP addresses, and the source TCP port. If we have a socket, sk is set to point to the sock structure for the open socket and we continue to process the incoming segment.

sk = _tcp_v4_lookup(skb->nh.iph->saddr, th->source,
skb->nh.iph->daddr,
ntohs(th->dest),tcp_v4_iif(skb));
if (!sk)
goto no_tcp_socket;

We process the incoming packet. If IP security is installed, we do the security transformation. Once we have the socket, sk, we can check the state of the connection. If the connection is in the TIME-WAIT state, we must handle any incoming segments in a special way. Once we are in TIME-WAIT, delayed segments must be discarded and an incoming TCP packet may contain a delayed segment.

process:
if (sk->sk_state == TCP_TIME_WAIT)
goto do_time_wait;
Check for the IPSec security  transformation.
if (!xfrm4_policy_check(sk, XFRM_POLICY_IN,  skb))
goto discard_and_relse;
if (sk_filter(sk, skb, 0))
goto discard_and_relse;
skb->dev = NULL;
bh_lock_sock(sk);
ret = 0;

If the socket is locked by the top-half process, it can’t accept any more segments, so we must put the incoming segment on the backlog queue by calling sk_add_backlog. If the socket is not locked, we try to put the segments on the prequeue. The prequeue is in the user copy structure, ucopy. Ucopy is part of the TCP options structure, Once segments are put on the prequeue, they are processed in the application task’s context rather than in the kernel context. This improves the efficiency of TCP by minimizing context switches between kernel and user. If tcp_prequeue returns zero, it means that there was no current user task associated with the socket, so tcp_v4_do_rcv is called to continue with normal "slow path" receive processing.

if (!sock_owned_by_user(sk)) {
if (!tcp_prequeue(sk, skb))
ret = tcp_v4_do_rcv(sk, skb);
} else
sk_add_backlog(sk, skb);

Now we can unlock the socket by calling bh_unlock_sock instead of unlock_sock, because in this function, we are executing in the "bottom half" context. Sock_put decrements the socket reference count indicating that the sock has been processed.

bh_unlock_sock(sk);
sock_put(sk);
return ret;

The label no_tcp_socket is where we end up if there is no current open TCP socket. We still must complete the checksum to see if the packet is bad. If the packet had a bad checksum, we increment the error counter. If the packet was good, we are here because the segment was sent to a socket that is not open, so we send out a reset request to bring down the connection. Next, we discard the packet and we are free to go.

no_tcp_socket:
if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb))
goto discard_it;
if (skb->len < (th->doff<<2)|| tcp_checksum_complete(skb)) {
bad_packet:
TCP_INC_STATS_BH(TcpInErrs);
} else {
tcp_v4_send_reset(skb);
}
discard_it:
kfree_skb(skb);
return 0;

We arrived here because we need to discard the packet. We decrement the reference count and free the packet.

discard_and_relse:
sock_put(sk);
goto discard_it;

We jumped here, do_time_wait, because the socket, sk, is in the TIME_WAIT state. The TIME_WAIT state requires a little more discussion because arriving packets require special Attention.

do_time_wait:
if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN,  skb))
goto discard_and_relse;

Next, we check to see if the header length is too short and we try to complete the checksum. If these tests fail, we get rid of the packet and go out. The function tcp_checksum_complete should return a zero if it is successful.

if (skb->len < (th->doff<<2)
 || tcp_checksum_complete(skb)) {
TCP_INC_STATS_BH(TcpInErrs);
goto discard_and_relse;
}

Tcp_timewait_state_process looks at the arriving packet, skb, to see whether it is a SYN, FIN, or a data segment and determines what to do. The states returned by tcp_timewait_state_process are shown in Table.

switch(tcp_timewait_state_process((struct  tcp_tw_bucket *)sk,
skb, th, skb->len)) {

TCP TIME_WAIT Status Values

According to RFC 1122, an arriving SYN can “wake up” and re-establish a connection in the TIME-WAIT state. In this case, tcp_ v4_ lookup_ listener is called to spawn a listener and establish a new connection.

case TCP_TW_SYN:
{
struct sock *sk2 = tcp_v4_lookup_listener(skb->nh.iph->daddr,
ntohs(th->dest),
tcp_v4_iif(skb));
if (sk2) {
tcp_tw_deschedule((struct tcp_tw_bucket  *)sk);
tcp_timewait_kill((struct tcp_tw_bucket  *)sk);
tcp_tw_put((struct tcp_tw_bucket *)sk);
sk = sk2;
goto process;
}
}

We received the last ACK from the peer, so we must send the final ACK to close the connection.

case TCP_TW_ACK:
tcp_v4_timewait_ack(sk, skb);
break;

A FIN was received, so we reset the connection by sending an RST, done at the no_tcp_socket label.

case TCP_TW_RST:
goto no_tcp_socket;

A duplicate ACK or delayed egment was received, so we discard it.

case TCP_TW_SUCCESS:;
}
goto discard_it;
}

The TCP Fast Path, Prequeue Processing

Linux TCP has two paths for input packet processing, a "slow" path and a "fast" path. The slow path is normal input processing. As discussed every socket has a two queues, a receive queue and a backlog queue, which is used if the receive queue is full or the socket is busy.

Along the slow path, packets received by TCP are placed on the socket’s receive queue only after the packet is determined to be a valid data segment containing in-order segment data. This is a large amount of processing and it is all done in the context of the "bottom half" of Linux, in the context of the network interface receive functions. Once the packets are on the queue, the socket is woken up and the scheduler executes the user-level task, which reads the packets from the queue. As part of slow path processing, when the receive queue is full or the user-level task has the socket locked, the packets are placed on the backlog queue.

In addition to the two queues used in the slow path, Linux TCP has a third queue called the prequeue. This third queue is used for fast path processing. Because of the high volume of data that TCP is expected to handle, Linux includes the fast path as one of the various speedups for optimized performance. As we saw, one of the important performance enhancing features of the Linux TCP/IP is TCP header prediction. The header prediction algorithm determines if a received packet is likely to be an in-order segment containing data received while the socket is in the ESTABLISHED state [JACOB93]. If these conditions are met, the packet is elected for processing by the fast path. In general, the Van Jacobsen algorithm assumes that at least half of all packets arriving while the connection is established will consist of data segments rather than ACK packets. By using header prediction, TCP’s low-level receive function tries to determine which of these packets meet the criteria, and those packets are immediately placed on the prequeue. When the user task is woken up, TCP processing of the packets on the prequeue is done by the user-level task bypassing many of the processing steps during the slow path. Fast path processing occurs before the normal processing of any packets on the regular receive queue. There are three functions and one data structure used with the TCP prequeue. The ucopy structure contains the queue itself. The Tcp_prequeue_init function, in the file linux /net /ipv4 /tcp_ ipv4.c, initializes the prequeue, and the inline function tcp_prequeue in file linux /include /net /tcp.h puts packets on the queue. Another function, tcp_prequeue_process in file linux /net /ipv4 /tcp.c, removes the data while running in the application task’s context when the socket receive function is called.

The prequeue is within the ucopy structure also defined in file linux /include /linux /tcp.h, which is in the TCP options part of the sock structure. The field, prequeue, contains the list of socket buffers waiting for processing. Task is the user-level task to receive the data. The iov field points to the user’s receive data array, and memory contains the sum of the actual data lengths of all of the socket buffers on the prequeue. Len is the number of buffers on the prequeue.

struct {
struct sk_buff_head prequeue;
struct task_struct *task;
struct iovec *iov;
int memory;
int len;
} ucopy;

Tcp_prequeue_init, defined in file linux /linux /net /tcp.h, initializes the elements of the ucopy structure including the list of socket buffers in the prequeue. This initialization occurs whenever a socket of type SOCK_ STREAM is open for the AF_ INET address family as part of the initialization of socket state information by the function tcp_v4_init_sock.

static_inline_ void tcp_prequeue_init(struct tcp_opt *tp)
{
tp->ucopy.task = NULL;
tp->ucopy.len = 0;
tp->ucopy.memory = 0;
skb_queue_head_init(&tp->ucopy.prequeue);
}

Tcp_prequeue, the function that puts socket buffers on the prequeue, is also defined in file linux /include /net /tcp.h. It queues up buffers only if there is a user task currently waiting on the socket. This is indicated by a non-NULL value in the task field of the ucopy structure. When the socket is woken up and a read is issued on the socket by the application, tcp_prequeue immediately processes the socket’s prequeue before "officially" calling the socket’s receive function through the system call interface.

static __inline__ int tcp_prequeue(struct
 sock *sk, struct sk_buff *skb)
{

Tp is set to the TCP options structure, which is retrieved from the sock structure.

struct tcp_opt *tp = &sk->tp_pinfo.af_tcp;

Here we check to see if a user task is currently waiting on this socket. If so, we place the socket buffer, skb, on the end of the queue. In addition, the user can control this behavior via sysctl, so we check here.

if (!sysctl_tcp_low_latency && tp->ucopy.task) {
__skb_queue_tail(&tp->ucopy.prequeue,  skb);
tp->ucopy.memory += skb->truesize;

If queuing the current TCP segment, skb, causes the prequeue to grow larger than the size of the socket’s receive buffer, the socket buffers on the prequeue are removed and the socket’s backlog receive function, backlog_rcv, is called to put each buffer on the backlog queue until all the buffers on the prequeue are handled. After this, the prequeue is reset to empty by setting the memory field in ucopy to zero.

if (tp->ucopy.memory > sk->rcvbuf) {
struct sk_buff *skb1;
if (sk->lock.users)
out_of_line_bug();
while ((skb1 = _skb_dequeue(&tp->ucopy.prequeue))!= NULL) {
sk->backlog_rcv(sk, skb1);
NET_INC_STATS_BH(TCPPrequeueDropped);
}
tp->ucopy.memory = 0;

When there is only one skb remaining in the prequeue, the socket is woken up. Recall about the delayed acknowledgment timer, we discussed how outgoing ACKs are held back waiting to be piggybacked on data segments. Therefore, while processing the prequeue, we reset the delayed acknowledgment timer to wait for a send-side segment to carry the ACK.

} else if (skb_queue_len(&tp->ucopy.prequeue)  == 1) {

This is where we wake up the socket so the prequeue will be processed.

wake_up_interruptible(sk->sleep);
if (!tcp_ack_scheduled(tp))
tcp_reset_xmit_timer(sk, TCP_TIME_DACK, (3*TCP_RTO_MIN)/4);
}

We return a one if we have placed the packet on the prequeue, and we return a zero if the packet was not queued. The zero return indicates to the caller that the prequeue was full and the packets were put on the backlog receive queue.

return 1;
}
return 0;
}

TCP Backlog Queue Processing

The function tcp_v4_do_rcv is the backlog receive function for TCP. It is set in the backlog_rcv field of the proto structure in file linux/net/ipv4/tcp_ipv4.c. Tcp_v4_do_rcv is called when the socket can’t accept any more incoming data on its receive queue.

int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
{

First, we check to see if we are in the ESTABLISHED state. If so, the skb is a likely candidate for fast path processing.

Tcp_rcv_established is called to do the processing of the input packet in the ESTABLISHED state. It does the header prediction to see if the skb can be processed in the fast path. Tcp_rcv_established returns a one if we must send a reset to the other side of the connection, and on successful processing of the packet, it returns a zero.

if (sk->state == TCP_ESTABLISHED) {
TCP_CHECK_TIMER(sk);
if (tcp_rcv_established(sk, skb,skb->h.th, skb->len))
goto reset;
TCP_CHECK_TIMER(sk);
return 0;
}

Here we check to see if the packet does not have a complete header ,and we complete the calculation of the checksum.

Tcp_checksum_complete returns a zero if the checksum is OK or it is already done, and a nonzero value if the checksum failed.

 if (skb->len <  (skb->h.th->doff<<2) || tcp_checksum_complete(skb))
goto csum_err;

If we are in the listening state, we check to see if the incoming packet, skb, is a SYN packet , which would be a connection request. If we receive a valid SYN, we must transition to the receive state. This processing is done by tcp_v4_hnd_req, which validates the connection request and returns a sock structure, nsk or NULL. The returned sock will be in the ESTABLISHED state.

if (sk->state == TCP_LISTEN) {
struct sock *nsk = tcp_v4_hnd_req(sk, skb);
if (!nsk)
goto discard;

Tcp_child_process continues with receive state processing on the child socket, nsk. It returns a zero if the skb was processed successfully, and returns a nonzero value if we must send a reset to the peer, generally because we received a reset during the brief amount of time the child socket was in the SYN received state. After all this has been successfully completed, the new child socket, nsk, will be in the ESTABLISHED state, ready to transfer data and we are done processing the incoming packet.

if (nsk != sk) {
if (tcp_child_process(sk, nsk, skb))
goto reset;
return 0;
}
}

If we are not in the LISTEN state, we proceed with normal state processing by calling tcp_rcv_state_process, which returns a zero if the skb was successfully processed and returns a one if we must send a reset request to the peer.

TCP_CHECK_TIMER(sk);
if (tcp_rcv_state_process(sk, skb,  skb->h.th, skb->len))
goto reset;
TCP_CHECK_TIMER(sk);
return 0;

These three exit labels—reset, discard, and csum_err—are for sending a reset to the peer, discarding the skb, and incrementing the TCP input error counter, respectively. The error counter is incremented if a bad packet was detected.

reset:
tcp_v4_send_reset(skb);
discard:
kfree_skb(skb);
return 0;
csum_err:
TCP_INC_STATS_BH(TcpInErrs);
goto discard;
}

TCP Receive State Processing

We discussed the 11 states of TCP and presented a general overview of how and where they are implemented in the Linux kernel. We know that as a TCP packet arrives, TCP must detect whether the packet contains a data segment or whether it contains a signal, which is one of the bits in the TCP header, SYN, FIN, RST, or ACK. This section, along with, explains the code that does the TCP receive processing. This section covers all TCP state processing as mandated by RFC 793 except the ESTABLISHED and the TIMEWAIT states, Tcp_rcv_state_process is in the file linux/net/ipv4/tcp_input.c. It is the function that does most of the work processing the TCP state of the open socket. The function returns a zero if the segment was successfully processed, and returns a one if the caller must send a reset.

int tcp_rcv_state_process
(struct sock *sk, struct sk_buff *skb,
struct tcphdr *th, unsigned len)
{
struct tcp_opt *tp =  &(sk->tp_pinfo.af_tcp);
int queued = 0;

Saw_tstamp is set later to a nonzero value if the packet, skb, contains the timestamp option.

tp->saw_tstamp = 0;

First, we handle the CLOSE, LISTEN, and SYN_SENT states. If the state is CLOSED, we discard the incoming segment.

switch (sk->state) {
case TCP_CLOSE:
goto discard;

If we are in the LISTEN state, this socket is acting as a server and is waiting for a connection. If the incoming segment includes an ACK, we must send a reset. If the incoming segment contains a SYN, we process the connection request. If our connection is in the RST state, we discard the incoming packet.

case TCP_LISTEN:
if(th->ack)
return 1;
if(th->rst)
goto discard;

This function is independent of both IPv4 and IPv6. However, where actions taken are specific to an address family, the function in the af_specific part of the tcp_opt structure is used. Recall that the tcp_opt, which was initialized when the socket, sk, was opened for one of the address families, AF_INET or AF_INET6.

if(th->syn) {
if(tp->af_specific->conn_request(sk,  skb) < 0)
return 1;
goto discard;
}

In the LISTEN state, any incoming packets that do not contain a SYN or an ACK are discarded. There is discussion in the comments of this function about whether we should process any data in the incoming segment based on the behavior of TCP over non-IP network protocols; however, the action in the current kernel release is to discard the segment.

goto discard;

If the current state of this socket is SYN_SENT, we must check for the ACK or SYN flags in the incoming segment to see if we should advance the state to ESTABLISHED. The function tcp_rcv_synsent_state_process does most of the work for the SYN_SENT state.

case TCP_SYN_SENT:
queued = tcp_rcv_synsent_state_process(sk,  skb, th, len);
if (queued >= 0)
return queued;

The comments contain some discussion about how to handle data in received segments while in the SYN_SENT state, and tcp_rcv_synsent_state_process can return a negative one if there is data to process. However, other than checking the URG flag, we do nothing with the data. A nonzero return indicates that a reset must be sent.

tcp_urg(sk, skb, th);
__kfree_skb(skb);
tcp_data_snd_check(sk);
return 0;

Timestamps are checked here before the receive state processing is completed. This is part of the check for wrapped sequence numbers, called Protection Against Wrapped Sequence Numbers (PAWS). Sequence numbers can wrap when the sequence value reaches the maximum integer value and increments to zero. Under conditions of high-speed networks with a high window scaling factor, it is possible for an “old” retransmitted segment to reappear with the same sequence number as the current wrapped value. To prevent this from happening, timestamps are used. The receiver checks the timestamps for ascending values, and any received segment with an earlier timestamp is thrown out. Refer to Stevens. Tcp_fast_parse_options updates three fields of the tcp_opt structure with the two values in the timestamp TCP option of the incoming packet. The rcv_tsval field is set to the received timestamp, and the rcv_tsecr gets the value of the echo reply timestamp. In addition, saw_tstamp is set, if the timestamp option was detected in the packet.

Tcp_fast_parse_options returns a zero if skb contains a short TCP header with no options at all. Tcp_paws_discard does the PAWS processing by checking the timestamp values and returns a one if it detects a segment that should be discarded.

if (tcp_fast_parse_options(skb, th, tp) && tp->saw_tstamp &&
tcp_paws_discard(tp, skb)) {

A reset packet is always accepted even if it flunks the PAWS test; otherwise, tcp_send_dupack enters quickack mode and sends a duplicate selective acknowledgment of the offending segment indicating to the sender that the packet, skb, is a duplicate segment.

if (!th->rst) {
NET_INC_STATS_BH(PAWSEstabRejected);
tcp_send_dupack(sk, skb);
goto discard;
}
}

At this point in the tcp_rcv_state_process function, we process all the states other than the SYN_SENT, LISTEN, and CLOSE states, which were handled earlier in the function. The steps for processing the remaining eight states are specified starting on page 69 of RFC 793, and tcp_rcv_state_process follows these steps closely. Step one is to validate the sequence number. Tcp_sequence sees if the sequence number is within the window. It returns a zero if the sequence number is outside of the current window. If we received a bad sequence number, we negatively acknowledge it by calling tcp_send_dupack and discard the incoming segment.

if (!tcp_sequence
(tp, TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq)) {
if (!th->rst)
tcp_send_dupack(sk, skb);
goto discard;
}

Step two is to check to see if the incoming packet is a reset. The LISTEN state was already handled earlier in this function. Tcp_reset will process the received reset request for all the other states. It sets the appropriate errors in the sock structure and the socket is destroyed.

if(th->rst) {
tcp_reset(sk);
goto discard;
}

Tcp_replace_ts_recent stores the timestamp in the tcp_opt structure, tp.

tcp_replace_ts_recent(tp,  TCP_SKB_CB(skb)->seq);

Step three on page 71 in RFC 793 is to check security and precedence and is ignored. Step four in the RFC is to check for a received SYN in the incoming packet that is inside the current window. A received SYN in the window is an error condition so the connection is reset.

if (th->syn &&  !before(TCP_SKB_CB(skb)->seq, tp->rcv_nxt)) {
NET_INC_STATS_BH(TCPAbortOnSyn);
tcp_reset(sk);
return 1;
}

Step five is to check for an ACK in the received packet, skb. This step is fairly complicated in the code, but essentially, if the acknowledge is acceptable, we proceed to the ESTABLISHED state.

if (th->ack) {

Tcp_ack processes incoming ACKs, and the FLAG_SLOWPATH argument tells tcp_ack to do comprehensive checking as well as update the window. It returns a one if the ACK is acceptable.

int acceptable = tcp_ack(sk, skb,  FLAG_SLOWPATH);
switch(sk->sk_state) {

If the ACK was acceptable and the connection is in the SYN_RECV state, we most likely are a server in the act of doing a “passive open.” We should move to the ESTABLISH state.

case TCP_SYN_RECV:
if (acceptable) {
tp->copied_seq = tp->rcv_nxt;
mb();
tcp_set_state(sk, TCP_ESTABLISHED);
sk->sk_state_change(sk);

Sk_wake_async wakes up the socket, but when the socket, sk, has been created with an active open, we receive a SYN on the client side of the connection when SYN packets are crossed. A socket that is being established with a passive open will not be woken up because the socket field of sk is NULL.

if (sk->sk_socket) {
sk_wake_async(sk,0,POLL_OUT);
}

Now we update SND_UNA as specified in RFC 793, page 72. We also update SND_WND with the advertised window shifted by the current scaling factor, snd_wscale, which holds the value in the most recent window scaling option received from the peer. Tcp_init_wl updates snd_wl1 to hold the sequence number in the incoming packet.

tp->snd_una = TCP_SKB_CB(skb)->ack_seq;
tp->snd_wnd = ntohs(th->window)  << tp->snd_wscale;
tcp_init_wl(tp, TCP_SKB_CB(skb)->ack_seq,
TCP_SKB_CB(skb)->seq);

Now that we are moving the socket to the ESTABLISHED state, we must do some housekeeping to make the socket ready to receive data segments. First, tcp_ack did not calculate the RTT, so the RTT is determined based on the timestamps if there was a timestamp option in the received ACK packet; the RTT value is kept in the srtt field of tcp_opt structure. The MSS is adjusted to allow for the size of the timestamp option. Tcp_init_metrics initializes some metrics and calculations for the socket. Next, the received MSS value is set to an initial guess based on the received window size, RCV.WND. Buffer space in the socket is reserved based on the received MSS and other factors. Last, tcp_fast_path_on calculates the pred_flags field in tcp_opt, which determines if the receive fast path is on, whether header prediction will be used.

if (tp->saw_tstamp &&  tp->rcv_tsecr && !tp->srtt)
tcp_ack_saw_tstamp(tp, 0);
if (tp->tstamp_ok)
tp->advmss -= TCPOLEN_TSTAMP_ALIGNED;
tp->af_specific->rebuild_header(sk);
tcp_init_metrics(sk);

Prevent spurious congestion window restart on the first data packet.

tp->lsndtime = tcp_time_stamp;
tcp_initialize_rcv_mss(sk);
tcp_init_buffer_space(sk);

We turn on the fast path since we are going to the ESTABLISHED state.

tcp_fast_path_on(tp);
} else {

If the incoming acknowledge packet was not acceptable, we return a one, which tells the caller (tcp_v4_do_rcv) to send a reset.

return 1;
}
break;

If the connection is in the FIN_WAIT_1 state and we receive an ACK, we enter the FIN_WAIT_2 state.

case TCP_FIN_WAIT1:
if (tp->snd_una == tp->write_seq) {
tcp_set_state(sk, TCP_FIN_WAIT2);

Setting shutdown to SEND_SHUTDOWN indicates that a shutdown (a packet containing an RST) should be sent to the peer later when the socket moves to the CLOSED state.

sk->shutdown |= SEND_SHUTDOWN;
dst_confirm(sk->sk_dst_cache);
if (!sk->dead)

If the socket is not orphaned, we wake it up, which moves it into the FIN_WAIT_2 state.

sk->sk_state_change(sk);
else {

Otherwise, we process the Linux TCP socket option, TCP_LINGER2. The value for this option is in the linger2 field of tcp_opt, which controls how long we remain in the FIN_WAIT_2 state before proceeding to the CLOSED state. However, if linger2 is negative, we proceed directly to CLOSED without passing through the FIN_WAIT_2 and the TIME_WAIT states.

int tmo;
if (tp->linger2 < 0 ||
(TCP_SKB_CB(skb)->end_seq !=  TCP_SKB_CB(skb)->seq &&
after(TCP_SKB_CB(skb)->end_seq - th->fin,
tp->rcv_nxt))) {
tcp_done(sk);
NET_INC_STATS_BH(TCPAbortOnData);
return 1;
}

Tcp_fin_time checks for the keepalive option, SO_LINGER. It calculates the time to wait in the FIN_WAIT_2 state based on the option value, default settings, and the re-transmit time. The keepalive timer is reset with the calculated timeout value.

tmo = tcp_fin_time(tp);
if (tmo > TCP_TIMEWAIT_LEN) {
tcp_reset_keepalive_timer(sk,
tmo - TCP_TIMEWAIT_LEN);
} else if (th->fin ||  sock_owned_by_user(sk)) {

If the incoming ACK included a FIN or the socket is locked, we reset the keepalive timer. The comment in the code states that if we don’t do this, we could lose the incoming FIN. We proceed as if the SO_LINGER option was selected so input state processing for the FIN state will resume in the keepalive timer function. Effectively, this should advance the state to TIME_WAIT when the keepalive timer expires.

tcp_reset_keepalive_timer(sk, tmo);
} else {
tcp_time_wait(sk, TCP_FIN_WAIT2, tmo);
goto discard;
}
}
}
break;

Now we continue our processing of the incoming ACK packet (step five, page 72 in RFC 793) by looking at the CLOSING and LAST_ACK states. If we are in the CLOSING state and we receive an ACK, we proceed directly to TIME_WAIT provided there is no outstanding data left for the send side of the connection to handle. If we are in the LAST_ACK state, we are in the process of doing a passive close, responding to a close call in the peer. An ACK received in this state means that we can close the socket, so we call tcp_done.

case TCP_CLOSING:
if (tp->snd_una == tp->write_seq) {
tcp_time_wait(sk, TCP_TIME_WAIT, 0);
goto discard;
}
break;
case TCP_LAST_ACK:
if (tp->snd_una == tp->write_seq) {
tcp_update_metrics(sk);
tcp_done(sk);
goto discard;
}
break;
}
else
goto discard;

At this point, we are done with processing an incoming ACK. The sixth step is to process an urgent request, so we check the URG bit in the incoming segment. This check is done in function tcp_urg, which continues with the processing of the urgent data.

tcp_urg(sk, skb, th);

Step seven is to process segment text.

switch (sk->state) {
case TCP_CLOSE_WAIT:
case TCP_CLOSING:
case TCP_LAST_ACK:
if (!before(TCP_SKB_CB(skb)->seq,  tp->rcv_nxt))
break;
case TCP_FIN_WAIT1:
case TCP_FIN_WAIT2:

we should queue up the received data received in one of these five states. we should send a RST, and this is what the BSD operating system does. Starting with version 4.4, Linux does a reset, too.

if (sk->shutdown & RCV_SHUTDOWN) {
if (TCP_SKB_CB(skb)->end_seq !=  TCP_SKB_CB(skb)->seq &&
after(TCP_SKB_CB(skb)->end_seq -  th->fin, tp->rcv_nxt)) {
NET_INC_STATS_BH(TCPAbortOnData);
tcp_reset(sk);
return 1;
}
}

Here we have the normal case where a data segment is received in the ESTABLISHED state; tcp_data_queue is called to continue the processing and put the data segment on the socket’s input queue.

case TCP_ESTABLISHED:
tcp_data_queue(sk, skb);
queued = 1;
break;
}

The following two functions, tcp_data_snd_check and tcp_ack-snd, determine if a data segment or an ACK, respectively, needs to be sent to the peer. The comment here states that "tcp_data could move socket to TIME_WAIT," but it is not clear how that can occur.

if (sk->state != TCP_CLOSE) {
tcp_data_snd_check(sk);
tcp_ack_snd_check(sk);
}
if (!queued) {
discard:
__kfree_skb(skb);
}
return 0;
}

TCP Processing Data Segments in Established State

Obviously, the purpose of TCP is to transfer data reliably and rapidly. Data is transferred between the peers while the connection is in the ESTABLISHED state. Once the socket is in the ESTABLISHED state, the function of the connection is to transfer data between the two sides of the connection as fast as the network and the peer will permit. The tcp_v4_do_rcv function covered earlier in the text checks to see if the socket is in the ESTABLISHED state while processing incoming packets. If so, the function tcp_rcv_established in file linux/net/ipv4/tcp_input.c is called to complete the processing. The primary purpose of this function is to copy the data from the segments to user space as efficiently as possible.Figure shows the Linux TCP ESTABLISHED state processing.

TCP established state.

The Linux kernel provides a fast path processing path to speed up TCP data transfer as much as possible during normal conditions where data is being copied through an open socket. It uses a method very similar to Van Jacobson’s e-mail about how to do TCP in 30 instructions [JACOB93]. The Linux method differs in that it does the header prediction in advance during the early part of the TCP receive path . Although somewhat different from Van Jacobson’s method, Linux perhaps improves the method in that the completion of fast path processing happens in the application program’s task rather than in the “bottom half” of kernel processing. Wherever possible, Linux uses header prediction to select the packets that are most likely to be “normal” data segments for fast path processing. If these packets are in-order data segments, they won’t require any processing other than copying the data into the application program’s receive buffer. Therefore, the fast path processing resumes in the context of the application program task while it is executing one of the socket read API functions.

If all incoming TCP packets were put through the fast path, it would no longer be fast. Therefore, the key to the fast path processing is to choose which packets have a high probability of being “normal” data segments that don’t require any special time-consuming handling. To accomplish this, we do header prediction to quickly mark these candidate packets early in the receive packet path. Prediction flags are calculated, which are later used to direct packets either to the fast path or the slow path. The header prediction value is calculated as follows:

prediction flags = hlen << 26 ^ ackw ^SND.WND, hlen is the header length ackw is the BOOLEAN ACK << 20 The preceding formula yields a prediction flags value that is equal to bytes 13 through 20 ( in network byte order) of the TCP header expressed as a 32-bit word. Thus, the predication flags will be exactly equal to the third 32-bit word in the header of an input segment that consists of a data segment with an ACK but no TCP options. The prediction flags are stored in the TCP options structure tp->pred_flags. It is calculated by the inline function, __tcp_fast_path_on, in file linux/include/net/tcp.h.

Even though header prediction flags were already calculated, while processing incoming packets in the ESTABLISHED state, there are a few checks that cause packets to be redirected to the slow path.

  • Our side of the connection announced a zero window. The processing of zero window probes is only handled properly in the slow path.
  • If this side of the conditions receives any out-of-order segments, fast path processing is disabled.
  • If urgent data is encountered, fast path processing is disabled until the urgent data is copied to the user.
  • If there is no more receive buffer space, the fast path is disabled.
  • Any failure of header prediction will divert a particular segment into the slow path.
  • –The fast path is only supported for unidirectional data transfers, so if we have to send data in the other direction, we default to the slow path processing of incoming segments.
  • If there are any options other than a timestamp in the incoming packet, we divert it to the slow path.

Tcp_rcv_established in file linux/net/ipv4/tcp_input.c is the function that does the heavy lifting of processing incoming data segments. This function is called only for packets received while the connection is already in the ESTABLISHED state. Therefore, it starts out with the assumption that the packets are to be processed in the fast path. Along the way, if it finds that the incoming packet needs a closer look, it is diverted to the slow path.

int tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
struct tcphdr *th, unsigned len)
{
struct tcp_opt *tp =  &(sk->tp_pinfo.af_tcp);
tp->saw_tstamp = 0;

Here is where the prediction flags are compared to the incoming segment. We also check to see that the sequence number is in order. The PSH flag in the incoming packet is ignored.

if ((tcp_flag_word(th) & TCP_HP_BITS)== tp->pred_flags &&
TCP_SKB_CB(skb)->seq == tp->rcv_nxt) {
int tcp_header_len = tp->tcp_header_len;

Check to see if there are any other options other than timestamp, and if so, send this packet to the slow path. The TCP options field saw_tstamp indicates that the incoming packet contained a timestamp option.

if (tcp_header_len == sizeof(struct tcphdr) +
TCPOLEN_TSTAMP_ALIGNED) {
__u32 *ptr = (__u32 *)(th + 1);

If the packet contains TCP options, we send it to the slow path.

if (*ptr != ntohl((TCPOPT_NOP << 24) |(TCPOPT_NOP << 16)
| (TCPOPT_TIMESTAMP << 8) |TCPOLEN_TIMESTAMP))
goto slow_path;
tp->saw_tstamp = 1;
++ptr;
tp->rcv_tsval = ntohl(*ptr);
++ptr;
tp->rcv_tsecr = ntohl(*ptr);

Now we do a quick check for PAWS. If the check fails, the packet gets a closer look in the slow path.

if ((s32)(tp->rcv_tsval -tp->ts_recent) < 0)
goto slow_path;

Here we check if the header length is too small and if there is any data in the packet.

if (len <= tcp_header_len) {

Here we check for the sending-side fast path. Essentially, we see if we are receiving packets with a valid header but no data, which could indicate that we are doing a one-way data (bulk) transfer in the outgoing direction. Therefore, we acknowledge the packet, free it, and check the send side. We don’t checksum the incoming packet, however, because it has been already been done for header-less packets.

if (len == tcp_header_len) {

The predicted packet is in the window.

if (tcp_header_len == (sizeof(struct tcphdr)  +
TCPOLEN_TSTAMP_ALIGNED) &&tp->rcv_nxt == tp->rcv_wup)
tcp_store_ts_recent(tp);
tcp_ack(sk, skb, 0);
__kfree_skb(skb);
tcp_data_snd_check(sk);
return 0;

The header is too small, so throw the packet away.

} else {
TCP_INC_STATS_BH(TcpInErrs);
goto discard;
}
} else {
int eaten = 0;

The global current always points to the currently running task, which is the task at the head of the list of tasks in the TASK_RUNNING state. We check to see if we are running in the context of the application task by seeing if the task structure pointer in ucopy is the same as current. Current was saved in the ucopy structure by tcp_recvmsg, the socket-level receive function when it was called by the application program (through the Linux system call interface, of course).

if (tp->ucopy.task == current &&
tp->copied_seq == tp->rcv_nxt  &&
len - tcp_header_len <= tp->ucopy.len  &&
sock_owned_by_user(sk)) {
__set_current_state(TASK_RUNNING);

Here is where we actually copy the data. If the data was successfully copied, we update rcv_nxt the next expected receive sequence number. While copying the data, cp_copy_to_iovec also completes the checksum if it wasn’t already done by the network interface hardware.

if (!tcp_copy_to_iovec(sk, skb,tcp_header_len)) {
if (tcp_header_len ==
(sizeof(struct tcphdr) +
TCPOLEN_TSTAMP_ALIGNED) &&
tp->rcv_nxt == tp->rcv_wup)
tcp_store_ts_recent(tp);
__skb_pull(skb, tcp_header_len);
tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
NET_INC_STATS_BH(TCPHPHitsToUser);
eaten = 1;
}
}

We are here either because there isn’t a user task context or the copy to user failed. If the copy to user failed, it is probably because of a bad checksum. We complete the checksum if necessary, and if it is bad, we go out. If there is no room in the socket, we complete processing in the slow path.

if (!eaten) {
if (tcp_checksum_complete_user(sk, skb))
goto csum_error;
if (tcp_header_len ==
(sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED) &&
tp->rcv_nxt == tp->rcv_wup)
tcp_store_ts_recent(tp);
if ((int)skb->truesize >sk->forward_alloc)
goto step5;
NET_INC_STATS_BH(TCPHPHits);

This is the receiver side of a bulk data transfer. We remove the header portion from the skb and put the data part of the segment on the receive queue.

__skb_pull(skb,tcp_header_len);
__skb_queue_tail(&sk->sk_receive_queue,  skb);
tcp_set_owner_r(skb, sk);
tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
}

Since we know we received a valid packet, the next few sections of code deal with the sending and receiving acknowledgments. Tcp_event_data_recv updates the delayed acknowledge timeout interval. Tcp_ack handles incoming ACKs.

tcp_event_data_recv(sk, tp, skb);
if (TCP_SKB_CB(skb)->ack_seq !=  tp->snd_una) {

If there is no need to send an ACK, we jump the next little section.

tcp_ack(sk, skb, FLAG_DATA);
tcp_data_snd_check(sk);
if (!tcp_ack_scheduled(tp))
goto no_ack;
}

We send an ACK if necessary.

if (eaten) {
if (tcp_in_quickack_mode(tp)) {
tcp_send_ack(sk);
} else {
tcp_send_delayed_ack(sk);
}
} else {
__tcp_ack_snd_check(sk, 0);
}

Since the data was already transferred, we delete the skb and call the data_ready callback to indicate that the socket is now ready for the next application read call.

no_ack:
if (eaten)
__kfree_skb(skb);
else
sk->data_ready(sk, 0);
return 0;
}
}

The following code section is the slow path processing of received data segments. This is where we end up if this function was called internally from the kernel’s “bottom half” or if prequeue processing couldn’t proceed for some reason. slow_path:

if (len < (th->doff<<2)||tcp_checksum_complete_user(sk, skb))
goto csum_error;

We do the PAWS check for out-of-order segments by checking for the timestamp TCP option.

if (tcp_fast_parse_options(skb, th, tp) && tp->saw_tstamp &&
tcp_paws_discard(tp, skb)) {

Even if PAWS checking indicates that we have received an out-of-order segment, we must still check for an incoming RST.

if (!th->rst) {
NET_INC_STATS_BH(PAWSEstabRejected);
tcp_send_dupack(sk, skb);
goto discard;
}
}

We resume standard slow path processing of data segments as specified by RFC 793. We must check the sequence number of all incoming packets.

if (!tcp_sequence(tp, TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq)) {

If the incoming segment is not acceptable, send an acknowledgment.

if (!th->rst)
tcp_send_dupack(sk, skb);
goto discard;
}

If we receive an RST, we silently discard the segment.

if(th->rst) {
tcp_reset(sk);
goto discard;
}
tcp_replace_ts_recent(tp,TCP_SKB_CB(skb)->seq);
if (th->syn && !before(TCP_SKB_CB(skb)->seq,  tp->rcv_nxt)) {
TCP_INC_STATS_BH(TcpInErrs);
NET_INC_STATS_BH(TCPAbortOnSyn);
tcp_reset(sk);
return 1;
}

The fifth step for the ESTABLISHED state is to check the ACK field. The sixth step is to process the URG flag.

step5:
if(th->ack)
tcp_ack(sk, skb, FLAG_SLOWPATH);
tcp_urg(sk, skb, th);

We are in the slow path where we haven’t prequalified the segments, so we follow the steps in RFC 793. Tcp_data_queue queues up data in the socket’s normal receive queue. It puts segments that are out of order on the out_of_order_queue. For data in the socket’s normal receive queue, processing is continued when the application program executes a read system call on the open socket, which will cause tcp_recvmsg to be called.

tcp_data_queue(sk, skb);
tcp_data_snd_check(sk);
tcp_ack_snd_check(sk);
return 0;
csum_error:
TCP_INC_STATS_BH(TcpInErrs);
discard:
__kfree_skb(skb);
return 0;
}

TCP TIME_WAIT State

Once one end of a connection receives an active close, it must stay in the TIME_WAIT state for two times the maximum segment lifetime. This section describes the processing of incoming packets when the TCP connection is in this state.

At this point, it is helpful to differentiate an active close from a passive close. An active close is caused by an explicit request by this end of the connection, but a passive close is done when a< FIN is received from the peer. The TIME_WAIT state serves several purposes. One is to prevent old segments from a closed connection that are hanging around in the network from reappearing in time to be confused with segments from a new connection. The other purpose is to hold on to the connection for a time that is longer than the maximum retransmission time, long enough to allow the peer to resend a last ACK (or ACK and data segment combination) when our last ACK was lost. One of the problems of TCP/IP when implemented in large servers is that there could be substantial memory requirements for maintaining many sockets in the TIME_WAIT state. Although this is not a feature of primary importance to embedded systems designers, it illustrates how Linux TCP/IP is designed in part to meet the needs of TCP servers that maintain hundreds or thousands of open connections. The reader will remember that the primary vehicle for holding TCP connections, receive buffers, and connection state is the sock structure. Each instance of a sock structure has memory requirements associated with it, including the sock structure itself and attached data buffers. In a very active server with many connections, it can get very expensive to hold many sockets active waiting for numerous connections to shut down.

tcp_tw_bucket Structure

To reduce these requirements, the TIME_WAIT state processing does not use the sock structure. Instead, it uses a data structure that is smaller and has no attached receive buffers. This data structure is called the tcp_tw_bucket and it is defined in file linux/include/net/tcp.h. The tcp_tw_bucket shares the first 16 fields with the sock structure so it can use the same list maintenance pointers and functions.

struct tcp_tw_bucket {

The common part matches the sock structure.

struct sock_common __tw_common;
#define tw_family __tw_common.skc_family
#define tw_state __tw_common.skc_state
#define tw_reuse __tw_common.skc_reuse
#define tw_bound_dev_if  __tw_common.skc_bound_dev_if
#define tw_node __tw_common.skc_node
#define tw_bind_node  __tw_common.skc_bind_node
#define tw_refcnt __tw_common.skc_refcnt

Substate holds the states that are possible while processing a passive close, FIN_WAIT_1, FIN_WAIT_2, CLOSING, and TIME_WAIT.

volatile unsigned char tw_substate;
unsigned char tw_rcv_wscale;

The following fields are for socket de-multiplexing of incoming packets. These five fields are in the inet_opt structure.

__u16 tw_sport;
__u32 tw_daddr;
__attribute__((aligned(TCP_ADDRCMP_ALIGN_BYTES)));
__u32 tw_rcv_saddr;
__u16 tw_dport;
__u16 tw_num;

The fields from here to the end are unique to tcp_tw_bucket.

int tw_hashent;
int tw_timeout;
__u32 tw_rcv_nxt;
__u32 tw_snd_nxt;
__u32 tw_rcv_wnd;
__u32 tw_ts_recent;
long tw_ts_recent_stamp;
unsigned long tw_ttd;
struct tcp_bind_bucket *tw_tb;
struct hlist_node *tw_death_node;
#if defined(CONFIG_IPV6) ||  defined(CONFIG_IPV6_MODULE)
struct in6_addr tw_v6_daddr;
struct in6_addr tw_v6_rcv_saddr;
int tw_v6_ipv6only;
#endif
} ;

The tcp_timewait_state_process Function

Linux TCP supports fast time-wait recycling to prevent the number of connections in the TIME_WAIT state from using too many resources. Since the TIME_WAIT state can be maintained for several minutes, there is a possibility that the number of connections in the TIME_WAIT state can grow very large. Therefore, the time-wait buckets, tcp_tw_bucket structure, are maintained in slots accessed through a hash function. In a busy TCP server, the buckets are re-cycled depending on the system control value, tcp_tw_recycle, used. The function tcp_timewait_state_process is in the file linux/net/ipv4/tcp_minisocks.c. It does most of the work of processing of incoming packets during TIME_WAIT. However, it also does the processing for the FIN_WAIT_2 state, which is part of the active close but before the TIME_WAIT state in the TCP state machine. Tcp_timewait_state_process returns an enum, tcp_tw_status, defined in file linux/include/net/tcp.h. The values of the enum are shown in Table. If you examine the function closely, you may observe that it actually repeats most of the steps done by the function tcp_rcv_state_process, described earlier in this chapter, for most of the states TIME_WAIT, but in abbreviated form.

enum tcp_tw_status
tcp_timewait_state_process
(struct tcp_tw_bucket *tw, struct sk_buff *skb,
struct tcphdr *th, unsigned len)
{
struct tcp_opt tp;
int paws_reject = 0;
tp.saw_tstamp = 0;

Here we check to see if the incoming packet contained a timestamp. If it does, we do the PAWS check for an out-of-order segment.

if (th->doff > (sizeof(struct  tcphdr)>>2) && tw->ts_recent_stamp) {
tcp_parse_options(skb, &tp, 0);
if (tp.saw_tstamp) {
tp.ts_recent = tw->tw_ts_recent;
tp.ts_recent_stamp = tw->tw_ts_recent_stamp;
paws_reject = tcp_paws_check(&tp,  th->rst);
}
}

Similar to the tcp_rcv_state_process, we check to see if we are in the FIN_WAIT2 state. If so, we must check for an incoming FIN, ACK, or an incoming out-of-order segment. If the earlier PAWS check found that the incoming segment is out of order, we must send an ACK.

if (tw->substate == TCP_FIN_WAIT2) {

This an acknowledgment should be sent for an unacceptable segment if we are in TIME_WAIT, so we return TCP_TW_ACK to tell the caller to send an ACK

if (paws_reject ||
!tcp_in_window(TCP_SKB_CB(skb)->seq,
TCP_SKB_CB(skb)->end_seq,  tw->tw_rcv_nxt,
tw->tw_rcv_nxt + tw->tw_rcv_wnd))
return TCP_TW_ACK;

If the incoming segment contains an RST, we can finally kill off the connection.

if (th->rst)
goto kill;

If we receive a SYN but it is “old” or out of the window, we send a RST.

if (th->syn && !before(TCP_SKB_CB(skb)->seq, tw->tw_rcv_nxt))
goto kill_with_rst;

We check to see if the incoming segment has a duplicate ACK. If so, we will want to discard the segment.

if (!after(TCP_SKB_CB(skb)->end_seq,  tw->tw_rcv_nxt) ||
TCP_SKB_CB(skb)->end_seq ==  TCP_SKB_CB(skb)->seq) {
tcp_tw_put(tw);
return TCP_TW_SUCCESS;
}

If the arriving segment contains new data, we must send a reset to the peer to kill off the connection. We also stop the TIME_WAIT state timer.

if (!th->fin ||  TCP_SKB_CB(skb)->end_seq != tw->rcv_nxt+1) {
kill_with_rst:
tcp_tw_deschedule(tw);
tcp_tw_put(tw);
return TCP_TW_RST;
}

At this point, we know the arriving segment is a FIN and we are still in the FIN_WAIT_2 state, so we enter the actual TIME_WAIT state. We also process the received timestamp, saving the incoming timestamp value in ts_recent and marking when it was received.

tw->tw_substate = TCP_TIME_WAIT;
tw->tw_rcv_nxt =  TCP_SKB_CB(skb)->end_seq;
if (tp.saw_tstamp) {
tw->ts_recent_stamp = xtime.tv_sec;
tw->tw_ts_recent = tp.rcv_tsval;
}

Tcp_tw_schedule is called to manage the timer, which determines the length of time the connection will remain in the TIME_WAIT state. RFC 1122 specifies that the timer should be 2MSL ( two times the Maximum Segment Lifetime). If possible, Linux makes an attempt to reduce the time to an amount based on the RTO (Re-transmission Timeout). The comments contain a note apologizing for the IPv4-specific code, but it is OK because the IPv6 implementation, unlike IPv4, doesn’t support fast time wait recycling. If we are part of an IPv6 connection, we pass a constant value of one minute to tcp_tw_schedule; otherwise, we pass in the RTO value, which could be less.

if (tw->tw_family == AF_INET &&
sysctl_tcp_tw_recycle &&  tw->tw_ts_recent_stamp &&
tcp_v4_tw_remember_stamp(tw))
tcp_tw_schedule(tw, tw->tw_timeout);
else
tcp_tw_schedule(tw, TCP_TIMEWAIT_LEN);
return TCP_TW_ACK;
}

Here we enter the “real” TIME_WAIT state. We receive a SYN in a connection in the TIME_WAIT state, we may re-open the connection. However, we must assign the initial sequence number of the new connection to a value larger than the maximum sequence number used in the previous connection. We must also return to the TIME_WAIT state if the incoming SYN is a duplicate of an old one from the previous connection.

if (!paws_reject &&
(TCP_SKB_CB(skb)->seq == tw->tw_rcv_nxt  &&
(TCP_SKB_CB(skb)->seq ==TCP_SKB_CB(skb)->end_seq || th->rst))) {

The value of zero in paws_reject indicates that the incoming segment is inside the window; therefore, it is either a RST or an ACK with no data.

if (th->rst) {

It is possible that this incoming RST will result in TIME_WAIT Assassination (TWA). The system control, sysctl_tcp_rfc1337, can be set to prevent TWA. Linux TCP does not prevent TWA as its default behavior, so if the system control is not set, we kill off the connection.

if (sysctl_tcp_rfc1337 == 0) {
kill:
tcp_tw_deschedule(tw);
tcp_tw_put(tw);
return TCP_TW_SUCCESS;
}
}

The incoming segment must be a duplicate ACK, so we discard it. We also update the timer by calling tcp_tw_schedule.

tcp_tw_schedule(tw, TCP_TIMEWAIT_LEN);
if (tp.saw_tstamp) {
tw->tw_ts_recent = tp.rcv_tsval;
tw->tw_ts_recent_stamp = xtime.tv_sec;
}
tcp_tw_put(tw);
return TCP_TW_SUCCESS;
}

If we reached here, the PAWS test must have failed, so we must have either an out-of-window segment or a new SYN. All the out-of-window segments are acknowledged immediately. To accept a new SYN, it must not be an old duplicate. The check mandated by RFC 793 only works at slower network speeds, less than 40Mbit/second. Although, the PAWS checks is sufficient to ensure that we don’t really need to check the sequence numbers, we do the mandated sequence number check.

if (th->syn && !th->rst  && !th->ack && !paws_reject &&
(after(TCP_SKB_CB(skb)->seq,  tw->tw_rcv_nxt) ||
(tp.saw_tstamp &&  (s32)(tw->tw_ts_recent - tp.rcv_tsval) < 0))) {
u32 isn = tw->tw_snd_nxt+65535+2;
if (isn == 0)
isn++;
TCP_SKB_CB(skb)->when = isn;
return TCP_TW_SYN;
}
if (paws_reject)
NET_INC_STATS_BH(PAWSEstabRejected);
if(!th->rst) {

We reset the TIME_WAIT state timer, but only if the incoming segment was an ACK or out of the window.

if (paws_reject || th->ack)
tcp_tw_schedule(tw, TCP_TIMEWAIT_LEN);

We tell the caller to acknowledge the bad segment.

return TCP_TW_ACK;
}

We must have received a RST, so we kill the connection.

tcp_tw_put(tw);
return TCP_TW_SUCCESS;
}

TCP Socket-Level Receive

When the user task is signaled that there is data waiting on an open socket, it calls one of the receive or read system calls on the open socket. These functions are translated at the socket layer into a call to tcp_recvmsg in file linux/net/ipv4/tcp.c. Tcp_recvmsg copies data from an open socket into a user buffer. Comments in the code state that starting with Linux kernel version 2.3, the socket is locked.

int tcp_recvmsg(struct sock *sk, struct msghdr *msg,
int len, int nonblock, int flags, int  *addr_len)
{
struct tcp_opt *tp = tcp_sk(sk);
int copied = 0;
u32 peek_seq;
u32 *seq;
unsigned long used;
int err;

Target is set to the minimum number of bytes that this function should return.

int target;
long timeo;
struct task_struct *user_recv = NULL;
lock_sock(sk);
TCP_CHECK_TIMER(sk);
err = -ENOTCONN;
if (sk->state == TCP_LISTEN)
goto out;
timeo = sock_rcvtimeo(sk, nonblock);

If the MSG_OOB flag is set, urgent data (incoming segments with the URG flag) is handled specially. At entry, seq is initialized to the next byte to be read because the copied_seq field of the tcp_opt structure contains the last byte that has been processed.

if (flags & MSG_OOB)
goto recv_urg;
seq = &tp->copied_seq;
if (flags & MSG_PEEK) {
peek_seq = tp->copied_seq;
seq = &peek_seq;
}

The number of bytes to read, target, is set to the low-water mark for the socket, sk->rcvlowat or len, whichever is less. The MSG_ WAITALL flag indicates whether this call will block for target number of bytes.

target = sock_rcvlowat(sk, flags &  MSG_WAITALL, len);

This do while loop is the main loop of tcp_recvmsg. In this loop, we will continue to copy bytes to the user until target bytes is reached or some other exception condition is detected while processing the incoming data segments.

do {
struct sk_buff * skb;
u32 offset;

If after copying some data, we encounter urgent data while processing segments, we stop processing.

if (copied && tp->urg_data  && tp->urg_seq == *seq)
break;

Here we check to see if there is a signal pending on this socket to ensure the correct handling of the SIGURG signal. The comment in the code states: “FIXME: Need to check this doesn’t impact 1003.1g and move it down to the bottom of the loop.” Next, we check to see if the socket has timed out, in which case we return an error.

if (signal_pending(current)) {
if (copied)
break;
copied = timeo ? sock_intr_errno(timeo) :  -EAGAIN;
break;
}

We get a pointer to the first buffer on the receive queue, and in the inner do while loop we walk through the receive queue until we find the first data segment. When we find the segment and know how many bytes to copy, we jump to found_ok_skb. Along the way, we calculate the number of bytes to copy (using the variable offset) from the skb.

skb = skb_peek(&sk->sk_receive_queue);

In this inner loop we keep examining packets until we find a valid data segment. Along the way, we check for FIN and SYN. If we see a SYN, we adjust the number of bytes to be copied by subtracting one from offset. A FIN drops us out of the loop.

do {
if (!skb)
break;

Now we check to see that the current byte to be processed is not before the first byte in the most recently received segment to see if somehow we got out of synchronization while processing the queue of packets. This is actually a redundant check because socket locking and multiple queues should prevent us from getting lost.

if (before(*seq, TCP_SKB_CB(skb)->seq)) {
printk(KERN_INFO "recvmsg bug: copied %X  seq %X n",
*seq, TCP_SKB_CB(skb)->seq);
break;
}
offset = *seq - TCP_SKB_CB(skb)->seq;
if (skb->h.th->syn)
offset—;
if (offset < skb->len)
goto found_ok_skb;
if (skb->h.th->fin)
goto found_fin_ok;
BUG_TRAP(flags&MSG_PEEK);
skb = skb->next;
} while (skb != (struct sk_buff  *)&sk->sk_receive_queue);

If we get here, it means that we found nothing in the socket receive queue. If we have packets in the backlog queue, we try to process that, too.

if (copied >= target &&  sk->backlog.tail == NULL)
break;

Now we must make a few checks to see if we need to stop processing packets. We check for an error condition in the socket, if the socked is closed, or if we received a shutdown request from the peer.

if (copied) {
if (sk->sk_err ||
sk->sk_state == TCP_CLOSE ||
(sk->sk_shutdown & RCV_SHUTDOWN) ||
!timeo ||
(flags & MSG_PEEK))
break;
} else {
if (sock_flag(sk, SOCK_DONE))
break;
if (sk->err) {
copied = sock_error(sk);
break;
}
if (sk->shutdown & RCV_SHUTDOWN)
break;
if (sk->state == TCP_CLOSE) {

Done is set on a socket when a user closes, so normally it is nonzero when the connection state is CLOSED. If done is zero on a CLOSED socket, it means that an application program is trying to read from a socket that has never been connected, which, of course, is an error condition.

if (!sock_flag(sk, SOCK_DONE)) {
copied = -ENOTCONN;
break;
}
break;
}
if (!timeo) {
copied = -EAGAIN;
break;
}
}
cleanup_rbuf(sk, copied);

We are here because there are no more segments on the receive queue left to process. We will process any packets on the prequeue that have been pre-qualified for fast path processing. Previously, header prediction has indicated that there are likely to be data segments received when the connection state is ESTABLISHED. The prequeue packets are processed by the user task instead of in the context of the "bottom half." Ucopy.task is set to current, which forces the prequeue segments to be copied from the prequeue later by the user task, current.

if (tp->ucopy.task == user_recv) {

Here we install a new reader task.

if (!user_recv&&  !(flags&(MSG_TRUNC|MSG_PEEK))) {
user_recv = current;
tp->ucopy.task = user_recv;
tp->ucopy.iov = msg->msg_iov;
}
tp->ucopy.len = len;
BUG_TRAP(tp->copied_seq == tp->rcv_nxt  ||
(flags&(MSG_PEEK|MSG_TRUNC)));

If the prequeue is not empty, it must be processed before releasing the socket. If this is not done, the segment order will be damaged at the next iteration through this loop. The packet processing order on the receive side can be thought of as consisting of a series of four pseudo-queues, packets in flight, backlog, prequeue, and the normal receive queue. Each of these queues can be processed only if the packets ahead of it have already been processed. The receive queue is now empty, but the prequeue could have had segments added to it when the socket was released in the last iteration through this loop. The prequeue is processed at the do_prequeue label.

if  (skb_queue_len(&tp->ucopy.prequeue))
goto do_prequeue;
}
if (copied >= target) {

Now, the backlog is processed to see if it is possible to do any direct copying of packets on that queue. At this point, we have processed the prequeue. Release_sock walks through all the packets on the backlog queue, sk->backlog, before waking up any tasks waiting on the socket.

release_sock(sk);
lock_sock(sk);
} else {

If there is no more data to copy and absolutely nothing more to do, we sit and wait for more data. The function tcp_data_wait puts the socket and (therefore the calling task) in the wait state, TASK_INTERRUPTIBLE. Daniel Bovet and Marco Cesati have an excellent discussion of Linux process scheduling policy .

timeo = tcp_data_wait(sk, timeo);
}
if (user_recv) {
int chunk;

We account for any data directly copied from the backlog queue in the previous step. We also return the scheduler to its normal state.

if ((chunk = len - tp->ucopy.len) != 0) {
NET_ADD_STATS_USER (TCPDirectCopyFromBacklog, chunk);
len -= chunk;
copied += chunk;
}
if (tp->rcv_nxt == tp->copied_seq  &&
skb_queue_len(&tp->ucopy.prequeue)) {
do_prequeue:

This is where we jumped to process any packets on the prequeue. The function tcp_prequeue. After calling, we adjust chunk to account for any data copied from the prequeue.

tcp_prequeue_process(sk);
if ((chunk = len - tp->ucopy.len) != 0) {
NET_ADD_STATS_USER  (TCPDirectCopyFromPrequeue, chunk);
len -= chunk;
copied += chunk;
}
}
}
if ((flags & MSG_PEEK) &&  peek_seq != tp->copied_seq) {
if (net_ratelimit())
printk(KERN_DEBUG"TCP(%s:%d): Application bug, race in  MSG_PEEK. n",
current->comm, current->pid);
peek_seq = tp->copied_seq;
}
continue;

This is where we jumped from the previous inner loop when we found a data segment on the receive queue. We figure out how much data we have to copy from the len field of skb and offset calculated earlier.

found_ok_skb:
used = skb->len - offset;
if (len < used)
used = len;

We must check for urgent data. Unless the socket option, SO_OOBINLINE, was set (indicated by the urginline field of the sock structure), we skip the urgent data because it was processed separately.

if (tp->urg_data) {
u32 urg_offset = tp->urg_seq - *seq;
if (urg_offset < used) {
if (!urg_offset) {
if (!sock_flag(sk, SOCK_URGINLINE)) {
++*seq;
offset++;
used—;
if (!used)
goto skip_copy;
}
} else
used = urg_offset;
}
}

This is where we copy the data to user space. If we get an error while copying, we return an EFAULT.

if (!(flags&MSG_TRUNC)) {
err = skb_copy_datagram_iovec(skb, offset, msg->msg_iov, used);
if (err) {

This is an exception condition.

if (!copied)
copied = -EFAULT;
break;
}
}
*seq += used;
copied += used;
len -= used;
skip_copy:
if (tp->urg_data && after(tp->copied_seq,tp->urg_seq))  {
tp->urg_data = 0;

Now that we are done processing urgent data, we turn on the fast path (set TCP header prediction). Fast path processing was turned off if input processing encountered a segment with urgent data (URG flag on) and a valid urgent pointer field in the TCP header.

tcp_fast_path_check(sk, tp);
}
if (used + offset < skb->len)
continue;
if (skb->h.th->fin)
goto found_fin_ok;
if (!(flags & MSG_PEEK))
tcp_eat_skb(sk, skb);
continue;

This is where we jumped if we found a packet containing a FIN in the receive queue. RFC 793 says that we must count the FIN as one byte in the sequence and TCP window calculation.

found_fin_ok:
++*seq;
if (!(flags & MSG_PEEK))
tcp_eat_skb(sk, skb);
break;
} while (len > 0);

We are at the end of the outer while loop, processing socket buffers until we have copied the amount of data, len, requested by the caller in the application program. If we dumped out of the loop, leaving any data on the prequeue, it must be processed now before getting out.

if (user_recv) {
if  (skb_queue_len(&tp->ucopy.prequeue)) {
int chunk;
tp->ucopy.len = copied > 0 ? len : 0;
tcp_prequeue_process(sk);
if (copied > 0 && (chunk = len -  tp->ucopy.len) != 0) {
NET_ADD_STATS_USER  (TCPDirectCopyFromPrequeue, chunk);
len -= chunk;
copied += chunk;
}
}
tp->ucopy.task = NULL;
tp->ucopy.len = 0;
}

Cleanup_rbuf cleans up the TCP receive buffer. It will send an ACK if necessary.

cleanup_rbuf(sk, copied);
TCP_CHECK_TIMER(sk);
release_sock(sk);
return copied;

We are done, so release the socket and get out.

out:
TCP_CHECK_TIMER(sk);
release_sock(sk);
return err;

We jumped here if we encountered urgent data while processing segments. Tcp_recv_urg copies the urgent data to the user.

recv_urg:
err = tcp_recv_urg(sk, timeo, msg, len,  flags, addr_len);
goto out;
}

Receiving Urgent Data

Urgent data is data received in a segment that has the URG TCP flag and a valid urgent pointer. Urgent data, also known as Out-of-Band (OoB) data, is handled separately from the normal data in the data segments. Theoretically, it is handled as a higher priority and gets passed up to the socket as OoB data. The function tcp_recv_urg in file linux/net/ipv4/tcp.c is called from tcp_recvmsg when a segment containing urgent data is encountered while processing the stream of data segments.

static int tcp_recv_urg(struct sock * sk,long timeo,
struct msghdr *msg, int len, int flags,
int *addr_len)
{
struct tcp_opt *tp = tcp_sk(sk);

The SOCK_URGINLINE flag in the sock structure is set from the SO_OOBINLINE socket option, which states that urgent data should be handled as if it were ordinary segment data. If this socket option was set, it is an error because we are in the function that is supposed to process the urgent data specially.

if (sock_flag(sk, SOCK_URGINLINE)||!tp->urg_data ||
tp->urg_data == TCP_URG_READ)
return -EINVAL;
if (sk->state==TCP_CLOSE && !sock_flag(sk, SOCK_DONE))
return -ENOTCONN;

Now that we have survived the initial steps, we copy the urgent data into the user’s buffer.

if (tp->urg_data & TCP_URG_VALID) {
int err = 0;
char c = tp->urg_data;
if (!(flags & MSG_PEEK))
tp->urg_data = TCP_URG_READ;

Setting the MSG_OOB flag tells the application that urgent data has been received. Memcpy_toiovec does the actual copying.

msg->msg_flags|=MSG_OOB;
if(len>0) {
if (!(flags & MSG_TRUNC))
err = memcpy_toiovec(msg->msg_iov, &c,  1);
len = 1;
} else
msg->msg_flags|=MSG_TRUNC;
return err ? -EFAULT : len;
}
if (sk->state == TCP_CLOSE ||(sk->sk_shutdown & RCV_SHUTDOWN))
return 0;

We should not block in this call, regardless of the blocking state of the socket. All implementations as well as BSD have the same behavior.

return -EAGAIN;
}

Searches relevant to you
Top