Sending Data From a Socket via TCP - Linux

The best way to describe the implementation of a complex protocol like TCP is to follow the data as it flows through the protocol. For the remaining part of this chapter, the send-side processing of TCP will be examined. It is important to keep in mind that TCP is probably the most complex part of the TCP/IP protocol suite. Because TCP provides a connection-oriented service at the transport layer, it is far more complicated than UDP. TCP must manage the relationship between the local host and the remote host, including retransmitting bad and lost packets, managing the connection state machine, buffering the data, and setting up and breaking down the connection. In addition, as will be seen, it does much more than this. Linux TCP implements all of the enhancements for security, reliability, and performance that have been developed in more than 20+ years since TCP was first specified in RFC 793. This section focuses primarily on the connection with the socket, how data is removed from the application and placed in queues for transmission. Next, Section discusses the actual packet transmission, covers the data structures used to keep control variables and configuration options, and covers the TCP timers used primarily on the send side of the connection. Figure shows the TCP connection state for the sending side.

TCP send state diagram

TCP send state diagram TCP send state diagram

Let’s look at how the send side of the protocol handles write information passed from the socket layer. Once an application layer program opens a socket of type SOCK_STREAM, the TCP protocol is invoked to process all write requests for data transfers through the open socket. The transmission of data through TCP is controlled primarily by the state of the connection with the peer and the availability of data in the send buffer. Because of the asynchronous nature of TCP, actual data transmission is independent of the application layer, as it writes data into the socket layer buffers. In general, TCP gathers the data into segments and passes it to the peer machine when there is available bandwidth in the network. This is different from UDP where the userlevel process controls transmission, because with UDP, a write of data to the socket results directly in the transmission of a datagram. When any of the socket layer write functions are invoked by the application, TCP copies the data into a list of socket buffers, which are queued up for later transmission. Most of the work associated with writing in a socket involves determining the type of socket buffer and managing the queue of buffers. It must be determined whether the transmission interface has scatter-gather capability, otherwise known as chained DMA. If the interface has scatter-gather support, TCP sets up a transfer of a chain of buffers to the networking device. In this case, TCP uses the fragment list in the shared info structure of the skbuf. In addition, TCP maintains a queue of socket buffers, so the send function must also determine if there is room for more data in the current socket buffer or if a new one must be allocated.

The Tcp_sendmsg Function

The tcp_msgsend function, defined in file linux/net/ipv4/tcp.c, is invoked when any user-level write or message sending function is invoked on an open SOCK_STREAM type socket. All the write functions are converted to calls to this function at the socket layer. The best way to show the operation of TCP on the send side is to examine this function, so we will follow it to see how it processes the data.
The socket layer is able to find tcp_sendmsg because it is referenced by the sendmsg field of the protocol block structure, prot, which is initialized at compile time in the file linux/net/ipv4/tcp_ipv4.c. The protocol block functions for TCP are shown in Table Tcp_sendmsg collects all the TCP header information possible, copies the data into socket buffers, and queues the socket buffers for transmission. It makes heavy use of the TCP options structure described. Tcp_sendmsg also sets many fields in the TCP control block structure, described in Section, which is used to pass TCP header information to the transmission side of the TCP protocol.

The variable iov is to retrieve the IO vector pointer in the message header structure in the argument msg. Tp points to the TCP options that will be retrieved from the sock structure sk. Skb is a pointer to a socket buffer that will be allocated to hold the data to be transmitted. Iovlen is set to the number of elements in the iovec.

struct iovec *iov;

The TCP options are retrieved from the sock structure and the sock structure is locked.

Mss_now holds the current maximum segment size (MSS) for this open socket.

The value of the SO_SNDTIMEO option is put in timeo unless the MSG_DONTWAIT flag was

set for the socket. timeo = sock_sndtimeo(sk, flags&MSG_DONTWAIT);

The next thing tcp_sendmsg does is wait for a connection to be established before sending the packet. The state of the connection for this open socket is checked, and if the connection is not already in the TCPF_ESTABLISHED or TCPF_CLOSE_WAIT state, TCP is not ready to send data so it must wait for a connection to be established. The value of the timeout set with the SO_SNDTIMEO socket option is passed into the wait_for_tcp_connect function.

The SOCK_ASYNC_NOSPACE bit in flags is cleared to indicate that this socket is not currently waiting for more memory.

clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags);

Mss_now is set to the current mss for this socket. The function tcp_current_mss takes into account MTU discovery, the negotiated MSS, and the TCP_MAXSEG option

mss_now = tcp_current_mss(sk);

Now, set up to send the data. Get iov and iovlen from the msghdr structure. Get ready to start copying the data into socket buffers.

If the application called shutdown with how set to one, this means that a socket shutdown on the send side of the socket has been requested, so get out.

if (sk->err || (sk->shutdown&SEND_SHUTDOWN)) goto do_error;

The IO vector structure is the mechanism where data is retrieved from user space into kernel space. Iovlen is the number of iov buffers queued up by the socket send call.

while (—iovlen >= 0) {

Seglen is the length of each iov, and from points to the data to be copied.

If send_head is null, there are not yet any segments queued for sending. Mss_now is the current negotiated MSS.

if (!tp->send_head == NULL(copy = mss_now - skb->len) <= 0) {

If send_head is NULL, it is necessary to allocate a new segment. Tcp_alloc_pskb is used as a TCP-specific wrapper for alloc_skb. It allocates a paged socket buffer. Paged socket buffers include the socket buffer shared info and are used for efficient handling of TCP segments and IP fragments. They are particularly suitable for network interfaces that have scatter-gather DMA capability that can send a chain of buffers via DMA with minimum overhead. Tcp_sendmsg calls select_size to round off the requested size to the nearest TCP segment size or page size determined from the MSS. Skb_entail sets up the flags and sequence numbers in the TCP control block and puts the new skb on the write queue. Copy is the amount of data to copy to the segment. It is set to mss_now, the most recent negotiated value of MSS, which is the best indicator of segment size at this point.

Check to see whether we can use the hardware checksum capability.

Copying the Data from User Space to the Socket Buffer

If possible, tcp_sendmsg tries to squeeze the data into the header portion of the skb before allocating a new segment, so it tries to determine the actual address where the data will be copied.

. . . if (copy > seglen) copy = seglen; if (skb_tailroom(skb) > 0) {

Skb_tailroom returns the amount of room in the end of the skb, and copy is set to the amount of available space if there is any. Skb_add_data copies the data to the skb and calculates the checksum while it is doing the copy. The partial checksum in progress is placed in the csum field of the socket buffer, skb. In "Linux Memory Allocation and Skbuffs," Tableincludes a description of the socket buffer structure.

If there was no tailroom in the main part of the socket buffer, skb, we try to find room in the last fragment attached to skb. If there is no room in the fragment, we allocate a new page and attach it by putting a pointer to it at the end of the frags array in the shared info part of the skb.Merge is set to one if there is any room in the last page, and i is set to the number of frags already in the socket buffer. Page is declared to point to a memory-mapped page, and off is set to the offset to the start of the data to be copied from the msghdr structure. TCP_PAGE is used several times in the tcp_sendmsg function. It is a macro that updates the sndmsg_page fiel in the tcp_options structure to hold a pointer to the current page in the frags array. The macro, TCP_OFF, also defined in linux/net/ipv4/tcp.c, updates the sndmsg_off field with an updated offset into the page.

Both these macros are defined in tcp.c.

Tcp_sendmsg checks to see if there is space in the last page in the socket buffer by calling can_coalesce, which returns a one if there is room. Next, the function determines if a new page is needed or a complete new socket buffer must be allocated. To do this, tcp_sendmsg checks if the frag slots in the skb are full or whether the outgoing interface has scatter-gather capability. SG means that the interface hardware can efficiently exploit socket buffers with attached pages by doing segmented DMA.

if (can_coalesce(skb, i, page, off) && off != PAGE_SIZE) { merge = 1;

We check to see if there is any reason why we can’t add a new fragment.

} else if (i == MAX_SKB_FRAGS || (!i && !(sk->route_caps&NETIF_F_SG))) {

Perhaps all the page slots are used or the interface is not capable of scatter-gather DMA. In any case, we know that we can’t add more data to the current segment. Therefore, we call tcp_mark_push to set the TCPCB_FLAG_PSH flag in the control buffer for this connection. This will mean that the PSH flag will be set in the TCP header of the current skb, and Tcp_sendmsg jumps to the new_segment label to allocate a new skb.

tcp_mark_push(tp, skb); goto new_segment; } else if (page) {

If the page has been allocated, off is aligned to the page boundary. Then, as an extra error check,off is checked for validity value, and if it isn’t valid, the page is freed and removed from the skb.

A new page is allocated if necessary, and copy, the length of data, is corrected for the page size.

if (!page) {

Allocate the new cache page.

Finally, tcp_sendmsg is ready to copy the data from user space to kernel space. It calls tcp_copy_to_page to do the copying, which in turn calls csum_and_copy_from_user to copy the data efficiently by simultaneously calculating a partial checksum. If tcp_copy_to_page returns an error, the allocated but empty page is attached to the sock structure for this open socket, sk, in the sndmsg_page field so the page will be de-allocated when the socket is released.

Now that the copy has been done, the skb is updated to reflect the new data. If merge is nonzero, it means that the data was merged into the last frag, so the size field in that frag must be updated. Otherwise, a new page was allocated, so fill_page_desc is called, which updates the frag array in the skb with information about the new page. The sndmsg_page field in the tcp_options structure is updated to point to the next page, and the sndmsg_off field gets an updated offset into the page.

Tcp_sendmsg Completion

At this point, most of the work of tcp_sendmsg is done. The data has been copied from user space to the socket buffer. The socket buffer’s frags array has been updated with a list of pages ready for sending segments. A zero value in the variable copied indicates that this is the first time through the while loop for the initial segment so the PSH flag in the TCP header is set to zero. As we will see in this section, the TCP header is not set directly at this point. Instead, the intended values are saved in the TCP control block for later when the queued socket buffers are removed for transmission. Stevens has an excellent discussion in Section 20.5 of the use of the PSH flag in TCP [STEV94].

. . . if (!copied) TCP_SKB_CB(skb)->flags &= ~TCPCB_FLAG_PSH;

The write_seq field in the TCP options structure is updated with the amount of data processed in this trip through the while loop. The end_seq field in the TCP control block is also updated with the amount of data that was processed. From, which points to the data source in the msg argument, and copied, which holds the total number of bytes processed so far, are updated for this iteration. Iovlen is the number of iovs in the msghdr structure, and seglen was initialized to this value at the start of the function. Each loop through this code processes one iov.

At this point, we decide if small segments should be pushed out or held. We call forced_push to check if we must send now no matter what the segment size is. If so, the PSH flag is set and the segments are transmitted.

This is where tcp_sendmsg ends up if there is insufficient number of buffers or pages. It waits until there are a sufficient number of buffers available.

This is the label where the code jumped if it is time to push the remaining data and exit the function. The value of copied is returned, indicating to the application program how much data was transmitted.

These last three labels mean that we are at the end of tcp_sendmsg. This is where we finally endup if errors were detected in the process of copying and processing the data.


All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

Linux Topics