This course contains the basics of Linux

Course introduction
Test Your Caliber
Interview Questions
Pragnya Meter Exam


Linux Sockets

The socket interface was originally developed as part of the BSD operating system. Sockets provide a standard protocol-independent interface between the application-level programs and the TCP/IP stack. As discussed the OSI model serves us well as the framework to explain networking protocol stacks. It defines all seven layers, including the three layers above the transport layer—session, presentation, and application. As we know, TCP/IP does not define these three upper layers. As a matter of fact, it does not define anything above the transport layer. From the viewpoint of TCP/IP, everything above the transport layer is part of the application. Linux is similar to traditional Unix in the sense that the TCP/IP stack lives in the kernel, sharing memory space with the rest of the kernel. All the network functions performed above the transport layer are done in the user application space. Linux provides an API and sockets, which are compatible with Unix and many other systems. Applications use this API to access the networking facilities in the kernel.

The original intent of Linux was to create an Operating System (OS) that is functionally compatible with the proprietary versions of Unix that were popular at the time Linux was originally developed. The socket API is the best known networking interface for Unix application and network programming. Linux has not disappointed us in that it has provided a socket layer that is functionally identical to traditional Unix. Almost every type of TCP/IP-based networking application has been successfully ported to Linux over many years and includes applications used in the smallest embedded system to the largest servers. Many excellent books on Unix network programming do a great job of explaining the socket API and Unix network layer programming; see [STEV98] for an excellent book. In this book we will not duplicate earlier efforts by providing elaborate explanations of application layer protocols. Instead, our intent is to explain the underlying structure of Linux TCP/IP. Therefore, although the socket API will be discussed, the emphasis is on the underlying infrastructure.

The Linux socket API conforms to all applicable standards, and application layer protocols are entirely portable to Linux, from other flavors of Unix and many other OSs as well. However, the underlying infrastructure of the socket layer implementation is unique to Linux. In this chapter we discuss netlink sockets and other ways that an application programmer can use sockets to interact with the protocols and layers in the TCP/IP stack. We discuss the definition of a socket, a detailed discussion of the sock structure, the socket API, and the socket call mapping for file operations. In addition, we cover how sockets can be used with a new protocol and how the netlink mechanism allows applications to control the operations of the internal protocols in TCP/IP.

What is a Socket?

One definition of the socket interface is that it is the interface between the transport layer protocols in the TCP/IP stack and all protocols above. However, the socket interface is also the interface between the kernel and the application layer for all network programming functions. All data and control functions for the TCP/IP stack pass through the socket interface. As we saw in the introductory chapters in this book, the TCP/IP stack itself does not include any protocols above the transport layer. Instead, Linux and most other operating systems provide a standard interface called sockets for all protocols above the transport layer. The socket interface is really TCP/IP’s window on the world. In most modern systems incorporating TCP/IP—and this includes Linux—the socket interface is the only way that applications make use of the TCP/IP suite of protocols.

Sockets have three fundamental purposes. They are used to transfer data, manage connections for TCP, and control or tune the operation of the TCP/IP stack. The socket interface is an elegant and simple design. This is probably the primary factor leading to the wide acceptance of the TCP/IP protocol stack by application programmers over many years. Sockets are generic—they have the capability of working with protocol suites other than TCP/IP, including Linux’s internal process-to-process communication, AF_UNIX. The protocol family types supported by Linux sockets are listed in Table The complete list of official protocol families is part of the assigned numbers database that is currently maintained by the Internet Assigned Numbers Authority (IANA) [IAPROT03].

Supported Protocol and Address Families

The socket API actually has two parts. It consists of a set of functions specifically for a network. It also contains a way of mapping standard Unix I/O operations so application programmers using TCP/IP to send and receive data can use the same calls that are commonly used for file I/O.

Instead of open, the socket function is used to open a socket. However, once the socket is open, generic I/O calls such as read and write can be used to move data through the open socket.

Socket, sock, and Other Data Structures for Managing Sockets

Sometimes the choice of names for data structures and functions can be confusing, and this is definitely the case when we discuss sockets in Linux. As with other operating systems, Linux uses similar names or terms to describe functionally different data structures. For example, several data structures for sockets can be confused with each other. In Linux, the three different data structures each have the letters "sock" in them. The first is the socket buffer defined in linux/include/linux/sk_buff.h. Socket buffers are structures to hold packet data that may or may not be created at a socket interface . In the Linux source code, socket buffers are often referred to by the variable sk6.

struct sk_buff *skb;

The next two data structures are covered in this chapter. The second data structure that we will discuss is the socket structure defined in linux/include/linux/net.h. The socket structure is not specific to TCP/IP. Instead, it is a generic structure used primarily within the socket layer to keep track of each open connection and as a vehicle to pass open sockets to and from the socket layer. Generally, each instance of a socket structure corresponds to an open socket that was open with the socket call. Sockets are also implicitly referenced in the application code by the file descriptor returned by socket. This socket structure is usually referenced by a variable called sock.

struct socket *sock;

The third data structure is another one we will discuss in detail in this chapter. It is called the sock structure and is defined in the file linux/include/net/sock.h. It is a more complex structure used to keep state information about open connections. It is accessed throughout the TCP/IP protocol but mostly within the TCP protocol. It is usually referenced through a variable called sk.

struct sock *sk;

The Sock Structure

In earlier versions of the kernel, the sock structure was much more complex. However, in 2.6, it has been greatly simplified in two ways. The structure is preceded with a common part that is generic to all protocol families. The other way it is different is that in Linux 2.6, instances of the sock structure are allocated from protocol-specific slab caches instead of a generic cache. In addition, following the structure there is an IPv4 and IPv6 specific part that contains the prot_info structure for each of the member protocols in the protocol family. This part is discussed in Section as part of the discussion of socket creation. The first part of the sock structure is kept in a structure called sock_common.

struct sock_common

The first field in skc_family contains the network address family, such as AF_INET for IPv4. Skc_state is the connection state, and skc_reuse holds the value of the SO_REUSEADDR socket option.

unsigned short skc_family;
volatile unsigned char skc_state;
unsigned char skc_reuse;

The next field, skc_bound_dev_if holds the index for the bound network interface device.

int skc_bound_dev_if;

The next two fields hold hash linkage for the protocol lookup tables. Finally, skc_refcnt is the reference count for this socket.

struct hlist_node skc_node;
struct hlist_node skc_bind_node;
atomic_t skc_refcnt;
} ;

Now we will look at the sock structure itself.

struct sock {

The first thing in the sock structure must be sock_common, described previously. Sock_common is first because tcp_w_bucket structure and perhaps other structures also have sock_common as the first part. This is to allow the same list processing and queuing functions to be used on both kinds of structures.

struct sock_common __sk_common;

Here we have some defines to make it easier to find the fields in the common part.

#define sk_family __sk_common.skc_family
#define sk_state __sk_common.skc_state
#define sk_reuse __sk_common.skc_reuse
#define sk_bound_dev_if__sk_common.skc_bound_dev_if
#define sk_node __sk_common.skc_node
#define sk_bind_node  __sk_common.skc_bind_node
#define sk_refcnt __sk_common.skc_refcnt

This field is not used for TCP/IP.

This field is not used for TCP/IP.

volatile unsigned char sk_zapped;

The next field is used with the RCV_SHUTDOWN and SEND_SHUTDOWN socket options. As we will see in later chapters, when the SEND_SHUTDOWN option is set, in TCP, an RST will be sent when closing the socket.

unsigned char sk_shutdown;

The next field, when set, indicates that this socket has a valid write queue. This field is used with the write_space callback function. It indicates whether to call sk_write_space in the function sock-wfree. Sk_userlocks holds the SO_SNDBUF and SO_RCVBUF socket option settings.

unsigned char sk_use_write_queue;
unsigned char sk_userlocks;

Sk_lock is the individual socket lock used for socket synchronization. The next field, sk_rcvbuf, is the size of the receive buffer in bytes and is set from the SO_RCVBUF socket option.

socket_lock_t sk_lock;
int sk_rcvbuf;

Sk_sleep is the sock wait queue, and sk_dst_cache is the pointer to the destination cache entry.

wait_queue_head_t *sk_sleep;
struct dst_entry *sk_dst_cache;
rwlock_t sk_dst_lock;

This field is used with the Security Policy Database (SPD).

struct xfrm_policy *sk_policy[2];

The next four fields are for queuing. Sk_rmem_alloc is the number of committed bytes in the receive packet queue, and sk_receive_queue is the receive queue. Sk_wmem_alloc is the transmit queue committed length, and sk_write_queue is the transmit queue.

atomic_t sk_rmem_alloc;
struct sk_buff_head sk_receive_queue;
atomic_t sk_wmem_alloc;
struct sk_buff_head sk_write_queue;

The next field is the number of optional committed bytes. It is used by socket filters, and the field sk_wmem_queued is the persistent write queue size.

atomic_t sk_omem_alloc;
int sk_wmem_queued;

Sk_forward_alloc is the number of bytes in pre-allocated pages, and sk_ alloca tion is the allocation mode.

int sk_forward_alloc;
unsigned int sk_allocation;

Sk_sndbuf is the size of the send buffer and is set with the SO_SNDBUF socket option.

int sk_sndbuf; Sk_flags contains the values of the socket options SO_BROADCAST, SO_KEEPALIVE, and SO_OOBINLINE. Next, sk_no_check contains the value of the SO_NO_CHECK socket option and indicates whether to disable checksums. Sk_debug holds the SO_DEBUG socket option and sk_rcvtstamp holds the SO_TIMESTAMP socket option.

unsigned long sk_flags;
char sk_no_check;
unsigned char sk_debug;
unsigned char sk_rcvtstamp;

The value in the next field indicates whether to send large TCP segments. Sk_route_caps is the route capabilities, NETIF_F_TSO.

unsigned char sk_no_largesend;
int sk_route_caps;

The value of sk_lingertime is set to TRUE if the SO_LINGER socket option is set. Sk_hashent contains a hash entry for several tables. The next field, sk_pair, is for the socket pair call and is not used for TCP/IP.

unsigned long sk_lingertime;
int sk_hashent;
struct sock *sk_pair;

The next field, sk_backlog, is the socket backlog queue. It is always used with the individual socket lock held. It requires low latency access.

struct {
struct sk_buff *head;
struct sk_buff *tail;
} sk_backlog;

This lock is for the six callback functions at the bottom of the sock structure.

rwlock_t sk_callback_lock;
struct sk_buff_head dsk__error_queue;

Sk_prot is a pointer to the protocol handler within AF_INET family or another protocol family.Sk_err is the last error on this socket, and sk_err_soft are for errors that don’t cause complete socket failure, only a socket that has timed out.

struct proto *sk_prot;
int sk_err,

The next two fields are for the current listen backlog and the maximum backlog set in the listen call. The field sk_priority holds the value of the SO_PRIORITY socket option.

unsigned short sk_ack_backlog;
unsigned short sk_max_ack_backlog;
__u32 sk_priority;

The next field, sk_type, holds the socket type SOCK_STREAM or SOCK_DGRAM.
Sk_localroute says to route locally only. It is the value of the SO_DONTROUTE socket option.

unsigned shortsk_type;
unsigned char sk_localroute;

The following field, sk_protocol, holds the protocol number for this socket within the AF_INET family. It is the same as the 1-byte protocol field in the IP header.

unsigned char sk_protocol;

The sk_peercred field is not used for TCP/IP. It is for passing file descriptors using Berkeley style credentials. The following three fields hold the value of the SO_RCVLOWAT, SO_RCVTIMEO, and the SO_SNDTIMEO socket options.

struct ucred sk_peercred;
int sk_rcvlowat;
long sk_rcvtimeo;
long sk_sndtimeo;
Sk_filter is for socket filtering.
struct sk_filter *sk_filter;

The next field points to private areas for various protocols.

void *sk_protinfo;

Sk_slab points to the slab cache from which this sock structure was allocated, and sk_timer is the sock cleanup timer.

kmem_cache_t *sk_slab;
struct timer_list sk_timer;

Sk_stamp is the timestamp of the most recent received packet. Sk_socket points to the socket structure for this socket.

struct timeval sk_stamp;
struct socket *sk_socket;
void *sk_user_data;

The next field, sk_owner, points to the module that owns this socket.

struct module *sk_owner;

The following five fields point to callback functions for this socket. The first field, sk_state_change, is called when the state of this sock is changed, and the next field, sk_data_ready, indicates that there is data available to be processed. Sk_write_space is called when there is buffer space available for sending. Sk_error_report is called when there are errors to report and is used with MSG_ERRQUEUE. Sk_backlog_rcv is called to process the socket backlog.

void (*sk_state_change)(struct sock *sk);
void (*sk_data_ready)(struct sock *sk, int  bytes);
void (*sk_write_space)(struct sock *sk);
void (*sk_error_report)(struct sock *sk);
int (*sk_backlog_rcv)(struct sock *sk,
struct sk_buff *skb);

Finally, the last field is the destructor function for this sock instance. It is called when all the references are gone and the refcnt becomes zero.

void (*sk_destruct)(struct sock *sk);
} ;

When a sock structure instance is allocated from the slab, following the sock structure is the inet_sock, which contains a protocol information part for IPv6 and IPv4.

struct inet_sock {
struct sock sk;
#if defined(CONFIG_IPV6) ||  defined(CONFIG_IPV6_MODULE)
struct ipv6_pinfo *pinet6;

The inet_opt structure for IPv4 is discussed in later Section.

struct inet_opt inet;
} ;

The Socket Structure

The socket structure is the general structure that holds control and states information for the socket layer. It supports the BSD type socket interface.

struct socket {

The first field contains the state of the socket and one of the socket state values shown later in Table The socket flags are in the next field and hold the socket wait buffer state containing values such as SOCK_ASYNC_NOSPACE.

socket_state state;
unsigned long flags;

Ops points to the protocol-specific operations for the socket. This data structure is shown later

struct proto_ops *ops;

The next field, fasync_list, points to the wake-up list for asynchronous file calls. For more information, see fsync(2). File points to the file structure for this socket. We need to keep a pointer here to facilitate garbage collection.

struct fasync_struct *fasync_list;
struct file *file;

Sk points to the sock structure for this socket. Wait is the socket wait queue.

struct sock *sk;
wait_queue_head_t wait;

Type is the socket type, and generally is SOCK_STREAM, SOCK_DGRAM, or SOCK_RAW. Passcred is not used for TCP/IP. It is for BSD style credential passing and holds the value of the SO_PASSCRED socket option

short type;
unsigned char passcred;
} ;

The Proto_ops Structure
This structure contains the family type for this particular set of socket operations. For IPv4, it will be set to AF_INET.

struct proto_ops {

Family is the address family. It is set to AF_INET for IPv4. Owner is the module that owns this socket.

int family;
struct module*owner;

Each of the following fields corresponds to a socket call. They are all pointers to the function implementing the protocol-specific operation.

int (*release)(struct socket *sock);
int (*bind) (struct socket
*sock,struct sockaddr *myaddr,
int sockaddr_len);
int (*connect) (struct socket
*sock,struct sockaddr *vaddr,
int sockaddr_len, int flags);
int (*socketpair)(struct socket
*sock1,struct socket *sock2);
int (*accept) (struct socket
*sock,struct socket *newsock, int flags);
int (*getname) (struct socket
*sock,struct sockaddr *addr,
int *sockaddr_len, int peer);
unsigned int (*poll) (struct file *file, struct socket *sock,
struct poll_table_struct *wait);
int (*ioctl) (struct socket
*sock, unsigned int cmd,unsigned long arg);
int (*listen) (struct socket *sock, int len);
int (*shutdown) (struct socket *sock, int  flags);
int (*setsockopt)(struct socket *sock, int level,int optname, 
char __user *optval,int optlen);
int (*getsockopt)(struct socket *sock, int level,int optname,
char __user *optval,int __user *optlen);
int (*sendmsg) (struct kiocb *iocb, struct socket
*sock,struct msghdr *m, int total_len);
int (*recvmsg) (struct kiocb *iocb, struct  socket
 *sock,struct msghdr *m, int total_len,int flags);
int (*mmap) (struct file *file, struct socket
*sock,struct vm_area_struct * vma);
ssize_t (*sendpage) (struct socket
*sock,  struct page *page,
int offset, size_t size, int flags);
} ;

Socket Layer Initialization

The Linux networking infrastructure can support multiple protocol stacks or address families. Each supported protocol suite has an address family that is registered with the socket layer. This is how the common socket API can be used with the member protocols in diverse protocol suites. Each address family is listed in Table In this book, we are primarily interested in TCP/IP and its address family AF_INET. AF_INET is registered during kernel initialization, and the internal hooks that connect the AF_INET family with the TCP/IP protocol suite are done during socket initialization. However, it is possible to register new protocol families dynamically with the socket layer.

However, in this section, we are mainly concerned about the initialization of sockets for the AF_INET address family. We also cover how protocol registration is done for each of the member protocols in TCP/IP.

The socket layer, like any other Linux kernel facility, has an initialization function called during kernel initialization in file linux/net/socket.c. Note that initialization functions can be quickly found because they are all typed __init. Sock_init is called before Internet protocol registration because basic socket initialization must be done before each of the TCP/IP member protocols can register with the socket layer.

void __init sock_init(void)
int i;

The first thing sock_init does is initialize the net_families array, which is the basic mechanism of socket protocol registration. Section describes in detail how the socket layer maps each of the user socket calls to the underlying protocol.

for (i = 0; i < NPROTO; i++)
net_families[i] = NULL;

Next, sk_init is called to initialize the slab cache for the sock data structure. This data structure, discussed earlier, contains all of the internal socket state information. Next, sock_init calls skb_init to set up the slab cache for socket buffers, or sk_buffs.

#ifdef SLAB_SKB

Now, we build the pseudo-file system for sockets, and the first step is to set up the socket inode cache. Linux, like other Unix operating systems, uses the inode as the basic unit for filesystem implementation.


Next, a pseudo filesystem is created for sockets called sock_fs_type by calling register_filesystem. Linux, like many operating systems, has a unified IO system, so that IO calls are transparent whether they are accessing devices, files, or sockets. We must register with the filesystem to use the IO system calls to read and write data through open sockets. discusses the IO system call mapping in more detail. Now that the socket file system is built, we can mount it in the kernel.

sock_mnt = kern_mount(&sock_fs_type);

The remaining part of protocol initialization is done later when the function do_initcalls in main.c is executed. The last thing that sock_init does is initialize netfilters if they have been configured into the kernel.


Family Values and the Protocol Switch Table

As discussed earlier, the socket layer is used to interface with multiple protocol families and multiple protocols within a protocol family. After incoming packets are processed by the protocol stack, they eventually are passed up to the socket layer to be handed off to an application layer program. The socket layer must determine which socket should receive the packet, even though there may be multiple sockets open over different protocols. This is called socket de-multiplexing, and the protocol switch table is the core mechanism. This mechanism functions very much like the BSD operating system where the term protocol switch table is also often used.

However, socket layer registration is discussed here because it is closely related to socket layer de-multiplexing. Linux makes a distinction between permanent and nonpermanent protocols. For example, permanent protocols include protocols such as UDP and TCP, which are a fundamental part of any functioning TCP/IP implementation. Removal of permanent protocols is not allowed; therefore, UDP and TCP canbe unregistered. However, other protocols can be added to the protocol switch table dynamically, and these protocols are considered nonpermanent. Figure illustrates the registration process. It shows how the inet_protosw structure is initialized with proto and proto_ops structures for TCP/IP, the AF_INET family.

AF_INET protocol family and socket calls.

The protocol switch registration mechanism consists of two functions and a data structure for maintaining the registered protocols. One of the functions is for registering a protocol, and the other function is for unregistration. Each of the registered protocols is kept in a table called the protocol switch table. Each entry in the table is an instance of the inet_protosw. The registration function, inet_register_protosw, puts the protocol described by the argument p into the protocol switch table.

void inet_register_protosw(struct  inet_protosw *p);

The unregistration function, inet_unregister_protowsw, removes a protocol described by the argument p from the protocol switch table.

void inet_unregister_protosw(struct  inet_protosw *p);

Each protocol instance in the protocol switch table is an instance of the inet_protosw structure, defined in file linux/include/protocol.h.

struct inet_protosw {

The first two fields in the structure, list and type, form the key to look up the protocol in the protocol switch table. Type is equivalent to the type argument of the socket call, and the values for this field are shown in Table

struct list_head list;
unsigned short type;

Values for Socket Types

The next field, protocol, corresponds to the well-known protocol argument of the socket call. This field is set to the protocol number for TCP, UDP, another protocol number, or zero for raw. This is the protocol number for the protocol that is being registered.

int protocol;

The field prot points to the protocol block structure. This structure is used when a socket is created. This structure is used to build an interface to any protocol that supports a socket interface. The next field, ops, points to a protocol-specific set of operation functions for this protocol. The proto_ops structure is discussed in Section It has the same definition as the ops field in the socket structure as discussed in Section.

struct proto *prot;
struct proto_ops *ops;

Capability is used to determine if the application layer program has permission for a socket operation. The appropriate level of permission is required for some socket operations with raw sockets. This is to prevent security attacks because raw socket operations can get access to internal operations of TCP/IP. See the section on Linux capabilities later in this chapter for more information.

int capability;

The next field, no_check, tells the network interface driver not to perform a checksum.

char no_check;

Finally, the last field, flags, is defined as one of two values.


If flags is set to INET_PROTOSW_PERMANENT, the protocol is permanent and can’t be unregistered. In this case, the unregistration function prints an error message and returns. For example, UDP and TCP have flags set to INET_PROTOSW_PERMANENT, and for raw sockets, (SOCK_RAW) flags value is set to INET_PROTOSW_REUSE.

unsigned char flags;
} ;

Member Protocol Registration and Initialization in IPv4

Here we will only look at socket layer registration. In the previous section, we made a distinction between permanent protocols and other protocols. Now we will look at how the permanent calls are registered with the protocol switch table. The permanent protocols in IPv4 are registered by the function inet_init, defined in file net/ipv4/af_inet.c.

static int __init inet_init(void)
  . . .

The code actually registers the protocols after they have been placed into an array.

for (r = &inetsw[0]; r <  &inetsw[SOCK_MAX]; ++r)
for (q = inetsw_array; q <  &inetsw_array[INETSW_ARRAY_LEN]; ++q)
. . .

In the preceding code snippet, we can see that inet_init calls inet_register_protosw for each of the protocols in the array, inetsw_array. The protocols in the array are UDP, TCP, and raw. The values for each protocol is initialized into the inet_protosw structure at compile time as shown here.

static struct inet_protosw inetsw_array[] =

The first protocol is TCP, so type is SOCK_STREAM and flags is set to permanent.

protocol: IPPROTO_TCP,
prot: &tcp_prot,
ops: &inet_stream_ops,
capability: -1,
no_check: 0,
} ,

The second protocol is UDP, so type is SOCK_DGRAM and flags is also set to permanent.

protocol: IPPROTO_UDP,
prot: &udp_prot,
ops: &inet_dgram_ops,
capability: -1,
} ,

The third protocol is “raw,” so type is SOCK_RAW and flags is also set to reuse. Notice t protocol value is IPPROTO_IP, which is zero, and indicates the "wild card," which means that a raw socket can actually be used to set options in any protocol in the IF_INET family. This corresponds to the fact that the protocol field is typically set to zero for a raw socket.

type: SOCK_RAW,
protocol: IPPROTO_IP,/* wild card */
prot: &raw_prot,
ops: &inet_dgram_ops,
capability: CAP_NET_RAW,
} ;

Once this registration is complete, other protocols usually implemented as modules may still add themselves to the protocol switch table at any time by calling inet_register_protocols.

Registration of Protocols with the Socket Layer

This is about important family values; that is, the registration of protocol families such as the Internet Protocol (IP) family. In this section, we cover the registration of an entire protocol family, which includes an entire suite of protocols, and how it is associated with one of the protocol families listed in Table.

For TCP/IP, this family is AF_INET. The protocol family registration step is necessary so the application programmer can access all the protocols that are part of the TCP/IP protocol family through the socket API functions. The socket layer family registration facility provides two functions and one key data structure. The first function, sock_register, registers the protocol family with the socket layer.

int sock_register(struct net_proto_family  *fam);

The family is passed in as a pointer to the net_proto_family structure, which is shown later in this section. Sock_register checks the family field in the net_proto_family structure pointed to by fam to make sure it is one of the family types listed in Table If family is within range, it copies the argument fam into a location in the global array net_families indexed by the family value.

static struct net_proto_family  *net_families[NPROTO];

Later, in this chapter, we will see how the net_families array is used to look up the protocol family associated with the domain for the requested socket. The second function, sock_unregister, is the protocol family unregistration function. int sock_unregister(int family);

In this function, family is the value of the protocol family, such as AF_INET. Sock_unregister reverses the registration process by setting the element in the array indexed by the parameter family to NULL. It only makes sense to use this function in a module that implements a nonpermanent protocol family. Protocol families can be written as modules so Linux provides an unregistration function. However, since IPv4 is generally statically compiled in to the kernel, sock_unregister will never be called for IPv4, the AF_INET family. IPv4 is almost always used with the Linux kernel, and it would be unusual to configure a Linux kernel without it. When protocol families are implemented as modules, the unregistration should be done from the exit function (typed __exit). The data structure net_proto_family is passed as a parameter to the sock_register function.

struct net_proto_family

The first field, family, corresponds to one of the protocol families listed in Table such as

int family;

The create field is a function pointer to the specific socket creation function for the protocol family specified by family.

int (*create)(struct socket *sock, int  protocol);

It also contains a few counter fields for security support that aren’t widely used.

short tauthentication;
short encryption;
short encrypt_net;

Finally, owner points to the module that owns this protocol family.

struct sockaddr_in

The Socket Application Programming Interface

The socket Application Programming Interface (API) functions are described in this section. Sockets are the fundamental basis of client server programming. Generally, socket programming follows the client-server model. At the risk of over-simplifying, we will define the server as the machine that accepts connections. In contrast, a client is the machine that initiates connections. This book won’t pretend to duplicate the work of other authors on network application programming; instead, refer to [STEV98]. However, in the interest of a complete description of how the socket interface functions in Linux, the socket API functions are provided with a description of the purpose of each call. Later in this chapter, in Section you will see what happens under the covers in the Linux TCP/IP stack when each of the functions described in this section is invoked.

Before we proceed with listing the API functions, we will discuss how IP and other addresses are passed through sockets. In general, when using the socket API, network addresses are stored in a sockaddr structure defined in file linux/include/linux/in.h. This structure holds different forms of address information.

struct sockaddr_in

Sin_port is the port number for the socket.

in_port_t sin_port;

Sin_addr is the IP address.

struct in_addr sin_addr;
} ;

We use the sockaddr_in structure with the socket API because sockets are generic and intended to work with a variety of protocol families and a variety of address formats, not just IPv4 with its well-known but limited 32-bit address format. This is why the sockaddr structure can vary in length depending on the format of the address it contains.

It is important to note that the terms port and socket are not synonymous. The port, along with the IP address, identifies the destination address for a packet. However, the socket is the identifier that the application uses to access the connection to the peer machine. Linux, like most Unix operating systems, provides a complete set of functions for Internet address and port manipulation.

Like most Unix operating systems before it, Linux provides a variety of IP address conversion functions for manipulation of Internet addresses. Included are functions to convert addresses between character strings and binary numbers. An example is inet_ntoa, which converts a binary IP address to a string. The Linux functions are compatible with the traditional BSD functions for network programming that have been used for many years. These conversion functions are just about indispensable for doing any type of network application programming and are too numerous to list in this section. However, you can consult the Linux-man page inet(3) for a detailed list.

The socket API deals with addresses and ports, and it is important to note that TCP/IP, like other network protocols, always considers ports and Internet addresses passed through the socket API to be in network byte order. Network byte order is the same as big endian or Motorola byte order.

Linux, like other Unix-compatible OSs, provides a set of conversion functions to convert integers of various lengths between host byte order and network byte order. For a list of these functions, see the Linux man page byteorder(3).

One more thing should be mentioned before exploring the socket API functions. There are two types of sockets, one for individual datagrams and another for a streaming sequence of bytes. These two types of sockets reflect the difference between connectionless and connectionoriented service. The UDP protocol provides the connectionless or datagram service and is accessed through sockets of type SOCK_DGRAM. The TCP protocol is accessed through sockets of type SOCK_STREAM, which provide connection-oriented service. Most of the socket calls can be used with either socket type. However, one of the socket calls, connect, has slightly different semantics depending on the socket type, but connect is more commonly used with TCP. Two of the socket calls, accept and listen, are not used for UDP at all. In addition, recv, recvfrom, recvmsg, send, sendto, and sendmsg are usually used only with UDP. Generally, TCP servers< clients use write and read to move data to and from the sockets. Now we will look at the socket API functions, the first and most important of which is socket which opens up a new connection.

int socket(int domain, int type, int  protocol);

Socket must be called first before the application can use any of the networking functi operating system for any purpose. Socket returns an identifier also known as a socket. This identifier is essentially a file descriptor and can be used in the same way as the file descriptor returned by the open system call. In other words, read and write calls can be done by specifying the socket. The first argument to the socket call, domain, specifies which protocol family will be accessed through the socket returned by this call. It should be set to one of the protocol families used in Linux (shown in Table) and for us it is generally AF_INET. In Linux generally, we use the terms protocol family and address family interchangeably.

The BSD derived socket implementations make a distinction between these two, but Linux does not. For compatibility with BSD, the protocol family is defined with names preceded by PF_, and the address families have names that begin with AF_, but the numerical values cor responding to each are identical. Type is generally set to one of three values for Linux TCP/IP. It is set to SOCK_STREAM if the caller wants reliable connection-oriented service generally provided by the TCP transport protocol. Type is specified as SOCK_DGRAM for connectionless service via the UDP transport protocol, or SOCK_RAW for direct network access to underlying protocols below the transport layer. The access to lower layer protocols is generally referred to as "raw network protocol" access. The allowed values for type are shown in Table Finally, the protocol argument to the socket call is typically set to zero when sockets are open for conventional UDP or TCP packet transmission. In some cases, though, the protocol field is used internally by the socket layer code to determine which protocol the socket accesses if the type field is insufficient. For example, to get raw protocol access to ICMP, protocol would be set to IPPROTO_ICMP and type is set to SOCK_RAW.

The next socket API call, bind, is called by applications that want to register a local address with the socket. The local address generally consists of the port number and is referred to as the name of the socket. Applications that are sending UDP packets or datagrams don’t have to call bind. If they want the peer to know where the packets came from, they should call bind.

int bind(int sockfd, struct sockaddr *my_addr, socklen_t addrlen);

Bind is usually used by application servers to associate an endpoint or port and address combination with a socket. Applications will call bind if they want the socket layer to know the port on which they will be receiving data. The port number and the local IP address are specified in myaddr in the form of the sockaddr structure. The listen API function is called by a SOCK_STREAM or TCP server to let the socket layer know that it is ready to receive connection requests on the sockets.

int listen (int s, int backlog );

Backlog specifies the length of the queue of pending connection requests while the server is waiting for the accept call to complete. Accept is called by the application when it is ready to accept a connection request. It returns a new socket for the accepted connection. The address of the peer requesting the connection is placed in the sockaddr structure pointed to by addr. Addrlen should point to a variable containing the size of struct sockaddr before calling accept. After the function call returns, addrlen points to the length of the new address in addr.

int accept (int s, struct sockaddr
 *addr,  socklen_t *addrlen);

The socket call, connect, is primarily used by a client application to establish a connectionoriented or SOCK_STREAM type connection with a server using the TCP protocol.

int connect (int s const struct sockaddr
  *serv_addr, socklen_t addrlen);

Serv_addr specifies the address and port of the server with which the caller wants to make a connection, and addrlen is set to the length of struct sockaddr. When used with TCP, connect actually causes TCP to initiate the connection negotiation between peers. Connect may also be used with SOCK_DGRAM, but in this case it does not actually cause a connection; instead, it only specifies the peer address information so the send call can be used through the socket, s. The socket call, socketpair, is not used for TCP/IP so it will not be discussed here. It is used for AF_UNIX domain sockets for inter-process communication within a system. We will see later as we discuss the internal structure of the socket layer that quite a bit of the complexity of sockets is due the fact that they must support AF_UNIX based inter-process communication. The next three socket calls—send, sendto, and sendmsg—each transmit data from socket s to a peer socket. Send is used only with a socket that has been connected, either as a server or client by having previously called connect. Sendto is used with any open SOCK_DGRAM socket because it includes the arguments to and tolen, which specify the address of the destination.

int send(int s, const void *msg, size_t len,int flags);
int sendto (int s, const void *msg,size_t len,int flags,
const struct sockaddr *to,socklen_t tolen);
int sendmsg (int s, const struct msghdr *msg,int flags);

Normally, these calls will block until there is sufficient buffer space to receive the packet of the specified length. The flags argument can contain any of the following values. All the socket flags are shown with their values in Table

struct msghdr {

Socket API Function Values for Flags


Msg_name contains the "name" of the socket. This is the destination IP address and port number for this message. Msg_namelen is the length of the address pointed to by msg_name.

void * msg_name;
socklen_t msg_namelen;

Msg_iov is an array of buffers of data to be sent or received. It is often referred to as the "scatter gather array" but it is not used only for DMA operations. Msg_iovlen is the number of buffers in the array pointed to by msg_iov.

struct iovec * msg_iov;
size_t msg_iovlen;

The following two fields are used for Posix 1003.1g ancillary data object information. The ancillary data consists of a sequence of pairs of cmsghdr and cmsg_data pairs. Msg_control is used to support the cmsg API function to pass control information to the underlying protocols. See the man page, cmsg(3) for more information.

void * msg_control;
socklen_t msg_controllen;
Msg_flags contains flags for the  received message.
int msg_flags; /* flags on received message  */
} ;

which discusses details about rtnetlink and netlink address family and how these are used to pass control information to underlying protocols. Struct iovec is used in the msghdr structure described earlier. It is defined in file

struct iovec

Iov_base is a pointer to the buffer’s base address. In BSD implementations, the field is typed caddr_t.

void *iov_base;
Size of the buffer.
__kernel_size_t iov_len;
} ;

The next three socket calls—recv, recvfrom, recvmsg—receive a message from a peer socket. As is the case with the socket call, send, recv is generally used with a connected socket.

int recv (int s, void *buf, size_t len, int  flags);
int recvfrom ( int s, void *buf, size_t len,
int flags, struct sockaddr *from,socklen_t *fromlen);
int recvmsg ( int s, struct msghdr *msg, int  flags);

Normally, these calls block until data is available of the specified length. Each call returns the length of the data read from the socket. After recvfrom returns, the parameters from and fromlen contain the sender’s address and length. Often these three calls are used with select to let the caller know when data is available. The argument flags may have one or more values from the list in Table All the socket flags are shown with their values in Table shown earlier. The two socket calls getsockopt and setsockopt are provided so the caller can access options or settings in the underlying protocols.

int getsockopt(int s,int level, int optname,
void *optval,socklen_t *optlen );
int setsockopt(int s,int level,
int optname, const void *optval,socklen_t optlen );

Level should be set to one of the values from Table each of which has the same values as the 1-byte protocol field in the IP header or the next header field of the IPv6 header. Generally, these values correspond with the 1-byte assigned numbers in the IANA database for IP protocol numbers. However, there are three exceptions: SOL_SOCKET, SOL_RAW, and SOL_IP. The value SOL_SOCKET indicates that the options settings refer to internal settings in the socket layer itself. SOL_RAW and SOL_IP indicate settings for IP internal protocols.

Values for Level Argument in Setsockopt System Call

The optname argument is set to one of the values shown in Table These values are defined in file linux/include/asm-i386/socket.h. Before the socket layer allows certain option values to be set, it checks to ensure that the user process has the appropriate level of permissions. These permissions are called capabilities and are described later

Values for optname in Getsockopt and Setsockopt Calls

Packet, Raw, Netlink, and Routing Sockets

To complete any discussion of Socket application layer programming, we must include information about Linux’s special sockets. These sockets are for internal message passing and raw protocol access. Netlink, routing, packet, and raw are all types of specialized sockets. Netlink provides a socket-based interface for communication of messages and settings between the user and the internal protocols. BSD-style routing sockets are supported by Linux netlink sockets. This is why, as shown in Table, AF_ROUTE and AF_NETLINK are identical. AF_ROUTE is provided for source code portability with BSD Unix. Rtnetlink includes extensions to the messages used in the regular netlink sockets. Rtnetlink is for application-level management of the neighbor tables and IP routing tables. Section includes more details about the internal implementation of netlink and rtnetlink sockets and how they interface to the protocols in the AF_INET family.

Packet sockets are accessed by the application when it sets AF_PACKET in the family field of the socket call.

ps = socket (PF_PACKET, int type,int protocol);

Type is set to either SOCK_RAW or SOCK_DGRAM. Protocol has the number of the protocol and is the same as the IP header protocol number or one of the valid protocol numbers. Raw sockets allow user-level application code to receive and transmit network layer packets by intercepting them before they pass through the transport layer. This type of socket is generally not used for link or physical layer access because the link layer headers are stripped from received packets before delivering them to the socket.

rs = socket ( PF_INET, SOCK_RAW,int protocol);

Protocol is set to the protocol number that the application wants to transmit or receive. A common example of the use of raw sockets is the ping command, ping(8). Ping is an application that accesses the ICMP protocol, which is internal to IP and does not register directly with the socket layer. The ping command sends ICMP echo request packets and listens for echo replies. When the ping application code opens the socket, it sets the protocol field in the socket call to IPPROTO_ICMP. Ping and other application programs for route and network maintenance make use of a Linux utility library call to convert a protocol name into a protocol number, getprotent(3).

Netlink sockets are accessed by calling socket with family set to AF_NETLINK.

ns = socket (AF_NETLINK, int type, int netlink_family);

The type parameter can be set to either SOCK_DGRAM or SOCK_STREAM, but it doesn’t really matter because the protocol accessed by is determined by netlink_family, and this parameter is set to one of the values in Table The send and recv socket calls are generally used with netlink. The messages sent through these sockets have a particular format. See the netlink(7) for more details. There is quite a bit more complexity to using the netlink sockets than< we are describing in this section. Later in this chapter, we will discuss the internal structure of netlink in more detail while we see how the netlink mechanism interfaces to the protocols in TCP/IP.

Values for the Netlink_family Argument

Routing sockets are specified by the AF_ROUTE address family. In Linux, routing sockets are identical to netlink sockets. Rtnetlink extends netlink sockets by appending netlink type messages with some additional attributes. Rtnetlink sockets are most often used for application layer access to the routing tables. These special socket types can be found in the man pages. Refer to the man pages netlink(7), rtnetlink(7), packet(7), and raw(7) for more information.

Security and Linux Capabilities

If there were no security mechanism, the special sockets discussed in the previous section could be used by programmers to gain almost complete access to underlying kernel structures, including the internals of the TCP/IP stack. In the wrong hands, this rich set of functions could be a vehicle for security violations. We want to prevent unauthorized users from engaging in Denial-of-Service (DoS) attacks by deleting routes or rerouting sockets. Linux capabilities are the mechanism used in recent versions of the kernel for defining levels of access. Traditional Unix systems had two levels of access, either root or user. With root access, you could do anything. More recently, however, Linux has implemented the POSIX.1e Draft Capabilities that divide the complete set of root-level privileges into subsets. The Linux capabilities is the mechanism for holding and granting the permissions. We won’t go into a detailed discussion of all the POSIX capabilities because it isn’t relevant to TCP/IP. Refer to the man page cap_init(3) for more details. We will cover how they are used to control access to sockets.

Capabilities are important because it stores the user-level permissions that are checked by the socket layer before allowing raw or netlink socket access. To perform these checks, Linux provides an internal function, capable, defined in file linux/include/linux/sched.h for checking capabilities against the currently executing user-level process.

extern int capable(int cap);

Starting with version 2.4, Linux provides a set of data structure and macros in the file linux/include/apability.h to hold and manipulate the capabilities. If security is configured into the kernel, CONFIG_SECURITY, the capable function points to an operation that is part of a plugin to the Linux 2.6 security framework. If not, the function capable is an inline that can be called from either inside the kernel or a user-level process. It returns the effective capabilities of the current process, the one that made the user-level socket call. This is how capable is implemented as an inline.

static inline int capable(int cap)
if (cap_raised(current->cap_effective, cap)) {
current->flags |= PF_SUPERPRIV;
return 1;
return 0;

In Linux, the global variable current always points to the currently executing process. The kernel capabilities are stored in a 32-bit integer, kernel_cap_t. Current stores the following three types of capabilities.

kernel_cap_t cap_effective, cap_inheritable, cap_permitted;

However, in the socket layer code, we only actually check the cap_effective capabilities because we are interested in the capabilities of the current application process making the socket call. See the file linux/include/linux/capability.h for a list of all the POSIX capabilities and the numerical values associated with each.

A Note about the Socket API and IPv6

This section, we will mention the address family used with the IPv6 and a few changes that were made to the 2.6 kernel socket API to accommodate the protocol suite. IPv6 introduces a new address family, AF_INET6, which is defined in linux/include/socket.h along with the other address families. In addition, there is a new socket address type for IPv6, sockaddr_in6, defined in file linux/include/linux/in6.h.

struct sockaddr_in6 {

The first two fields look very much like the sockaddr_in structure. The following field, sin6_family, is set to the value AF_INET6 or 10.

unsigned short intsin6_family;

This field, sin6_port, is the port number for either UDP or TCP. The next field, sin6_flowinfo, is the IPv6 flow information.

__u16 sin6_port;
__u32 sin6_flowinfo;

The next field, sin6_addr, is the actual IPv6 address defined as a union. The last field is the scope ID, which defines the scope of the address.

struct in6_addr sin6_addr;
__u32 sin6_scope_id;
} ;

IPv6 sockets are backward compatible with IPv4. IPv6 specifies that data can be sent either via IPv4 or IPv6 through any open IPv6 socket. The socket API is defined so that application code that transmits data over IPv6 will also be compatible with IPv4 without modification. Underneath the covers, the actual 128-bit IPv6 address has a subtype that includes the 32-bit IPv4 address. In addition, to aid programmers in writing code that is independent of either protocol type, Linux provides API library functions. These functions convert addresses between the IPv4 and IPv6 formats. The first function, getaddrinfo(3), is for network address and service translation.

int getaddrinfo (const char *node, const char
*service, const struct addrinfo *hints, struct addrinfo **res);

It is a generic function that combines the functionality of three other functions: getipnodebyaddr(3), getservbyname(3), and getipnodebyname(3). Getaddrinfo creates either IPv4 or IPv6 type address structures for use with the bind or connect socket calls. If not NULL, hints specifies the preferred socket type. It points to an instance of addrinfo, the fields of which determine the socket type. Either or both of the next two parameters, node or service, may be specified, but only one of them can be NULL. Node specifies an address in IPv4 format, an address in IPv6 format, or a hostname. Service specifies the port number. Getaddrinfo supports multiple addresses and multihoming; therefore, the result res is a linked list of addrinfo structures.

The data structure, addrinfo, is used in both the hints and result for the getaddrinfo(3) library function. If used for hints, addrinfo specifies the preferred address family AF_INET, AF_INET6, or AF_UNSPECIFIED.

struct addrinfo {
int ai_flags;
int ai_family;
int ai_socktype;
int ai_protocol;
size_t ai_addrlen;
struct sockaddr *ai_addr;
char *ai_canonname;
struct addrinfo *ai_next;

The next function, freeaddrinfo(3) , deletes the linked list of addrinfo structures, pointed to by res, which were created by getaddrinfo (3).

void freeaddrinfo (struct addrinfo *res );

The last function that we will discuss is getnameinfo(3) .

int getnameinfo (const struct sockaddr
*sa, socklen_t salen, char *host, size_t 
hostlen, char *serv, size_t  servlen, int flags);

The parameter sa points to a generic socket address structure and can be either sockaddr_in or sockaddr_in6. Getnameinfo is really a generalized function for address to node name translation that can work with the address formats in both IPv4 and IPv6. It converts numerical address to and from text host names in a way that is independent of IPv4 and IPv6.

Implementation of the Socket API System Calls

Many operating systems designed for embedded systems are implemented in a flat memory space. These operating systems were originally designed for CPUs that don’t have a Memory Management Unit (MMU) that maps physical memory to virtual memory. In these systems, which are generally smaller, the operating system kernel functions can be called directly from application-level programs without doing any memory address translation. In contrast, Linux is a virtual memory operating system. It requires a processor with an MMU. In a virtual memory operating system, each user process runs in its own virtual address space. The socket API functions are included in the system calls. These system calls are different from ordinary library calls because they can be used in nonblocking mode so that the calling user process does not have to wait while the operating system completes the processing of the request. In addition, arguments in the function call are in the memory space of the user process and must be mapped into kernel space before they can be accessed by any kernel-level code. In our case, these arguments point to data to be sent and received through the TCP/IP protocol stack.

The socket API supports other protocol families besides AF_INET or TCP/IP, and all the protocol families supported by Linux are shown in Table In addition, Linux provides the capability of defining a new module containing an unknown protocol family. Because of this, there are several steps involved with directing each application layer socket call to the specific protocol that must respond to the request. This is a complex process and has several steps. First, any address referenced in the call’s arguments must be mapped from user space to kernel space.

However, we do discuss the Linux slab cache system of memory allocation as part of the examination of socket buffers. Next, the functions themselves must be translated from generic socket layer functions to the specific functions for the protocol family. Finally, the functions must be translated from the protocol family generic functions to the specific functions for the member protocol in the family. As we shall see in this section, each of the socket API calls is mapped to a set of corresponding calls in the kernel with a sys_ in front of the name, and each of these calls are defined in the file linux/net/socket.c. Most of these functions don’t do much other than call the address familyspecific function through the ops field in the socket structure. Sys_socket is discussed separately in another section because it is quite complex in that it creates a new socket and sets up the structures to allow the other socket API functions work.

Earlier in this chapter, we covered each of the socket API functions, but in this section, we will see what happens under the hood when the socket API functions are called. Figure shows how the application layer socket calls are mapped to the corresponding protocol-specific kernel functions. When any of the socket API functions are called, it causes an interrupt to the kernel’s syscall facility, which in turn calls the sys_socketcall function in the file linux/net/socket.c. In many CPU architectures such as the Intel x86 family, pointers to the system call’s arguments are passed to the kernel in one or more CPU registers. The implementation of the socket call itself is discussed in Section The implementation of the other socket function is discussed in this section.

The Socket Multiplexor

The purpose of the socket multiplexor is to unravel the socket system calls. This mechanism is implemented in the file linux/net/socket.c and consists largely of the function sys_socketcall. This function maps the addresses in the arguments from the user-level socket function to kernel space and calls the correct kernel call for the specified protocol.

asmlinkage long sys_socketcall(int call, unsigned long __user *args);

The first thing it does is map each address from user space to kernel space. It does this by calling copy_from_user. Next, sys_socketcall invokes the system call function that corresponds to a user-level socket call. For example, when the user calls bind, sys_socketcall maps the user-level bind to the kernel function, sys_bind, and listen is mapped to sys_listen. Each of these socket system call functions is also defined in the file linux/net/socket.c. Each of these functions returns a file descriptor (fd). The fd is also referred to as a socket. To support standard IO, sockets and file descriptors are treated as the same thing from the IO call’s point of view. The fd serves as a handle to reference the open socket. Each socket API function includes the file descriptor for the open socket as the first parameter. When a socket API function is called, in the kernel, we use fd to fetch a pointer to the socket structure that was originally created when the socket was opened. Once we have a pointer to the socket structure, we retrieve the function specific to the address family and protocol type through the open socket. To do this, we call the protocol- specific function through a pointer in the structure pointed to by the ops field of the socket structure. The socket structure is discussed in Section, and the proto_ops structure in Section

For an example of how the socket calls are mapped, we will look at the specific socket API call, send. When the application calls send, the kernel translates this call to sys_send, which calls sock_sendmsg, which in turn calls the sendmsg function for the protocol by dereferencing the proto_ops structure in the socket.

return sock->ops->sendmsg(iocb, sock, msg, size);

Sys_bind, sys_listen, and sys_connect do little other than call the address family’s function. In addition, sys_getname and sys_getpeername both map to the address family’s function for the open socket, fd.

Sys_send does nothing more than call sys_sendto. Instead of calling directly into the protocol, sys_sendto calls the socket layer function sock_sendmsg, described in Section In addition, sys_recv calls sys_recvfrom directly. Sys_recvfrom calls the socket layer function, sock_recvmsg.

Sys_setsockopt and sys_getsockopt check the level argument. If level is set to SOL_SOCKET, one of the socket layer functions, sock_setsockopt or sock_getsockopt is called. If level is anything else, the respective protocol-specific function is called through the proto_ops structure accessed by the ops field of the socket structure. Sys_shutdown also calls the corresponding protocol specific function through the shutdown field in the proto_ops structure. Sys_sendmsg and sys_recvmsg have a bit more work to do than the other socket functions. They must verify that the iovec buffer array contains valid addresses first. Each address is mapped from kernel to user space later when the data is actually transferred but the addresses are validated now. After completing the validation of the iovec structure, sock_sendmsg and sock_recvmsg functions are called, respectively. Sys_accept is a bit more complicated because it has to establish a new socket for the new incoming connection. The first thing it does is call sock_alloc to allocate a new socket. Next, it has to get a name for the socket by calling the function pointed to by the getname field in the ops field in the socket structure. Remember that the "name" of a socket is the address and port number associated with the socket. Next, it calls sock_map_fd to map the new socket into the pseudo socket filesystem. Refer to Section to see how this works because it is very similar to what the socket call does.

Implementation of Socket Layer Internal Functions

In the previous section, we showed how the application layer calls to the socket system calls. Next, we discussed how these calls get resolved to the internal socket layer functions. Now, we will look at some of these internal socket layer functions in more detail.

The first two functions in the socket layer, sock_sendmsg and sock_recvmsg, will be discussed briefly. They are called from sys_sendmsg and sys_recvmsg, respectively. These two functions are implemented in the socket layer to support the sending and receiving of "credentials." This is a Unix-compatible method of passing file descriptors among user-level applications. Sock_sendmsg calls scm_send before calling the protocol-specific sendmsg function, and likewise, sock_recvmsg calls scm_recv after calling the protocol-specific recvmsg function. The functions scm_send and scm_recv are defined in linux/net/core/scm.c and implement the "credentials."

Next, we will discuss the functions sock_read, sock_write, sock_fcntl, sock_close, and sock_ioctl. These functions are also defined in file linux/net/socket.c and are actually socket layer implementations of the IO system calls. Each of these functions is called with a file pointer in the argument file. They first call the socki_lookup function to get the socket structure, and then call the socket layer function with a pointer to the socket structure. In Section we saw how the file IO system calls are mapped to sockets. The functions, sock_read and sock_write set up an iovec type msghdr structure before calling sock_recvmsg and sock_sendmsg, respectively.

Sock_setsockopt and sock_getsockopt are called from the system call if level is set to SOL_SOCKET. The purpose of these functions is to set values in the sock structure according to the options that were passed as a parameter by the application layer. The system-level functions, sys_setsockopt and sys_getsockopt call these socket layer functions before any protocol-specific settings are altered.

Sock_setsockopt gets a pointer to the sock structure from the sk field of the socket structure, sock, which was passed as an argument. Next, it sets options in the sock structure, sk, based on the values pointed to by the optname and optval arguments. Refer to Section for a description of the fields in the sock structure. If SO_DEBUG is set in optname, debug is set, reuse is set to the value of SO_REUSEADDR, localroute to the value of SO_DONTROUTE, no_check is set to the value of SO_NO_CHECK, and priority to the value of SO_PRIORITY. In addition, bsdism is set to the value of SO_BSDCOMPAT, passcred to the value of SO_PASSCRED, rcvtstamp to the value of SO_TIMESTAMP, and rcvlowat to the value of SO_RCVLOWAT. If SO_SNDBUF or SO_RCVBUF are set, sndbuf or rcvbuf are set to minimum values or two times optval, whichever is greater. The priority field in sk is set to optval if SO_PRIORITY is set in optname, but capabilities are checked first.

A few other fields in sk are handled specially. If SO_KEEPALIVE is set in optname, and the protocol field in sk is equal to IPPROTO_TCP, tcp_set_keepalive is called with the value in optval. Remember that the protocol field of the sock structure comes from the protocol argument in the socket function that created the socket. If the SO_LINGER option is present, the linger field is set and the value of the lingertime field is calculated from optval. Sock_getsockopt reverses what sock_setsockopt does. It retrieves certain values from the sock structure for the option socket and returns them to the user. Refer to Section on the sock structure to see which fields hold values for the socket options. A few options deserve special attention because they don’t have a cor responding action in sock_setsockopt. For example, if SO_ERROR is set in optname, the current socket error is retrieved from the err field in the sock structure and the err field is cleared atomically. If the SO_TYPE option is set, the value of the type field is returned.

Implementation of Protocol Internal Socket Functions

Most of the behavior specific to the member protocols in the AF_INET family is described in later chapters. In this section, we will complete the discussion of how each member protocol communicates with the socket layer. As shown in Figure the file descriptor fd is used to map each socket API call with a function specific to each protocol. The mechanism that sets up this mapping is described in detail in Section for the socket system calls, and in Section for the IO system calls. In addition, as we saw in Section each of the protocols registers itself with the protocol switch table. When the socket structure is initialized, as described in Section the ops field was set to the set of protocol-specific operations from the entry in the protocol switch table.

Mapping of socket calls.

Once all the complex initialization is done as described in other sections, the actual mapping is quite simple. In most cases, the “sys_" versions of the socket functions simply call sockfd_lookup to get a pointer to the socket structure and call the protocol’s function through the ops field. The function sys_getsockname in file linux/net/socket.c can provide us with a simple example. This function is called in the kernel when the user executes the getsockname socket API function to get the address (name) of a socket.

asmlinkage long sys_getsockname(int fd, struct sockaddr
*usockaddr, int *usockaddr_len)

Fd is the open socket. After returning, usockaddr will point to the address for the socket. usockaddr_len points to the length of the socket address.

struct socket *sock;
char address[MAX_SOCK_ADDR];
int len, err;
sock = sockfd_lookup(fd, &err);

This is where the protocol-specific function is called. The ops field contains the socket functions for each protocol. The protocols are UDP for SOCK_DGRAM and TCP for SOCK_STREAM.

if (err)
goto out_put;
err = move_addr_to_user(address,len,usockaddr,usockaddr_len);

Here the return values are mapped into user space.


This bumps the use count on the open “file” associated with the socket, fd.out:

return err;

Creation of a Socket

Before the user can perform any operations with the TCP/IP stack, she creates a new socket by calling the socket API function. Sockets are generic and not necessarily associated with TCP/IP, so quite a few things happen before any code in the TCP/IP stack itself gets called. Sockets can be created for many protocol types other than TCP/IP. This section will describe what happens under the hood in the Linux kernel when the function socket is called. Earlier in this chapter, we described how the socket API functions get mapped to the protocol family-specific functions. It is through this mapping that the sys_socket function executes an AF_INET specific function for the TCP/IP protocol family. First, the socket layer is responsible for activities that are not specific to a particular protocol family like AF_INET. After the generic initialization is complete, socket creation will call the socket creation function for the protocol-specific family. After we discuss the generic socket creation, we will discuss what happens during socket creation for the AF_INET protocol family.

Sock_create, defined in file linux/net/socket.c, is called from sys_socket. This function initiates the creation of a new socket.

int sock_create(int family,int type,int protocol,struct socket **res); 

First, sock_create verifies that family is one of the allowed family types shown in Table. Then, it allocates a socket by calling sock_alloc, which returns a new socket structure, sock. See Section for a discussion of the socket structure. Sock_alloc, called from sock_create, returns an allocated socket structure. The socket structure is actually part of an inode structure, created when sock_alloc calls new_inode. It is necessary to have an inode for Linux IO system calls to work with sockets. Section explores the IO system call mapping in more detail. Most of the fields in the inode structure are important only to "real" filesystems; however, a few of them are used by sockets. The fields in the inode structure that are used for sockets are listed in Table

Inode Structure Fields Used by Sockets

Once the inode is created, sock_alloc retrieves the socket structure from the inode. Then, it initializes a few fields in the socket structure. The Inode field is set to point back to the inode structure containing this socket structure, and fasync_list is set to NULL. Fasync_list is for supporting the fsync system call for synchronizing the in-memory portions of a file with permanent storage. For a socket, fsync will flush the buffered data in the socket layer. Posix.1b defines the behavior of the fsync system call. See [GALL95] for more information about Posix.1b.

Sockets maintain a state related to whether an open socket represents a connection to a peer or not. These states are maintained in the field in the socket structure called, state, which is initialized to SS_UNCONNECTED when the socket is created. See Table for a description of the socket states. These states are defined in the file, linux/include/linux/net.h.

Socket States

It is important to remember that sockets are not just for TCP/IP. The states maintained in the socket structure do not contain all the same states as TCP, which is quite a bit more complicated. The socket state really only reflects whether there is an active connection. Sockets support other protocol families and must be generic. The internal logic to manage the protocol attached to a new socket will be maintained in the protocol itself, and TCP is no exception. However, since the socket layer must support several protocols, it requires some internal connection management logic in addition to what is in the internal TCP implementation.

For now, we set ops to NULL. Later, when the family member protocol-specific create function is called, ops will be set to the set of protocol-specific operations. Sock_alloc initializes a few more fields before returning. The flags field is initialized to zero because no flags are set yet. Later, the application will specify the flags by calling one of the send or receive socket API functions. Table contains a description of the flags used with the socket API. Finally, two other fields, sk and file, are initialized to NULL, but they both deserve a little attention. The first of these fields, sk, will be set later to point to the internal sock structure by the protocol-specific create function. File will be set to a file pointer allocated when sock_map_fd is called. The file pointer is used to maintain the state of the pseudo file associated with the open socket. After returning from the sock_alloc call, sock_create calls the create function for the protocol family. It accesses the array net_families to get the family’s create function. For TCP/IP, family will be set to AF_INET.

Creation of an AF_INET Socket

The create function for the TCP/IP protocol family, AF_INET is inet_create, is defined in file af_inet.c.

static int inet_create(struct socket *sock,int protocol);

Inet_create is defined as static because it is not called directly. Instead, it is called through the create field in the net_proto_family structure for the AF_INET protocol family. Inet_create is called from sys_socket when family is set to AF_INET. In inet_create, we create a new sock structure called sk and initialize a few more fields. The new sock structure is allocated from the slab cache, inet_sk_slab. Linux has multiple inet_sk_slabs, one for each protocol. Linux slab caches are more efficient if they are specific for the purpose, so, most fields can be pre-initialized to common values. We call sk_alloc to allocate the sock structure from the slab cache that is specific to the protocol for this socket.

sk = sk_alloc(PF_INET, GFP_KERNEL, 

Sk points to the new slab cache. Next, inet_create searches the protocol switch table to look for a match from the protocol.

After getting the result from the search of the protocol switch table, the capability flags are checked against the capabilities of the current process, and if the caller doesn’t have permission to create this type of socket, the user level socket call will return the EPERM error.

Now, inet_create will set some fields in the new sock data structure, however, many fields are pre-initialized when allocation is done from the slab cache. The field sk_family is set to PF_INET. The prot field is set to the protocol’s protocol block structure that defines the specific function for each of the transport protocols. No_check and ops are set according to their respective values in the protocol switch table. If the type of the socket is SOCK_RAW, the num field is set to the protocol number. As will be shown in later chapters, this field is used by IP to route packets internally depending on whether there is a raw socket open. The sk_destruct field of sk is set to inet_sock_destruct, the sock structure destructor. The sk_backlog_rcv field is set to point to the protocol-specific backlog receive function. Next, some fields in the protocol familyspecific part of the sock structure are initialized. As discussed in Section the sock structure is followed by a protocol-specific portion for each of the two protocol families, IPv4 and IPv6, and this part is accessed through a macro, inet_sk, which is defined in the file linux/include/linux/ip.h.

#define inet_sk(__sk) (&((struct inet_sock *)__sk)->inet)

The inet_opt structure for IPv4 is also defined in the file linux/include/linux/ip.h.

struct inet_opt

These first few fields are for socket layer de-multiplexing of incoming packets. Daddr is the peer IPv4 address, and rcv_saddr is the bound local IPv4 address. Dport is the destination port, num is the local port, and saddr is the source address.

__u32 daddr;
__u32 rcv_saddr;
__u16 dport;
__u16 num;
__u32 saddr;

The uc_ttl field is for setting the time-to-live field in the IP header.

int uc_ttl;

Tos is for setting the type of service field in the IP header. The cmsg_flags field is used by setsockopt and getsockopt to communicate network layer IP socket options to the IP protocol layer. In addition, ip_options points to the values associated with the options set in the cmsg_flags field.

int tos;
unsigned cmsg_flags;
struct ip_options *opt;

Sport is the source port number.

__u16 sport;

Hdrincl is for raw sockets. It states that the IP header is included in the packet delivered to the application level.

unsigned char hdrincl;

This field is the multicasting time-to-live.

__u8 mc_ttl;

The next field indicates whether multicast packets should be looped back. Pmtudisc indicates whether MTU discovery should be performed on this interface.

__u8 mc_loop; 
__u8 pmtudisc;

The next field, id, contains the counter for identification field in the IP header. This field is used by the receiving machine for re-assembling fragmented IP packets.

__u16 id;
unsigned recverr : 1,
freebind : 1;

Mc_index is the index for the output network interface used for transmission of multicast packets, and mc_addr is the source address used for outgoing packets sent to a multicast address.

int mc_index;
__u32 mc_addr;

This field, mc_list, points to the list of multicast address groups to which the interface has subscribed.

struct ip_mc_socklist *mc_list;

The next field is the cached page from the sendmsg socket function, and sndmsg_off is the offset into the page.

struct page *sndmsg_page;
u32 sndmsg_off;

The following structure keeps information about the IP options needed to build an IP header on each outgoing IP fragment. Since all the fragments have almost identical headers, the options are kept here to speed the process of building IP headers on consecutive fragments. It is called cork, because the socket is "corked," waiting for all fragments of the total IP datagram to be transmitted.

struct {
unsigned int flags;
unsigned int fragsize;
struct ip_options *opt;
struct rtable *rt;

This field, length, is the total length of all frames in the fragmented IP datagram.

int length;
u32 addr;
struct flowi fl;
} cork;
} ;

Some fields in the inet_opt are initialized by inet_create. Some of these fields are related to multicast transmission, In each case, inet points to the instance of the inet_opt structure shown previously.

inet->uc_ttl = -1;
inet->mc_loop = 1;
inet->mc_ttl = 1;
inet->mc_index = 0;
inet->mc_list = NULL;

The default time-to-live field, mc_ttl is initialized because this value will be used in the time-tolive IP header field for multicast packets. Even though these values are initialized here, the application may change the values in these fields later through the setsockopt call. Finally, inet_create calls the protocol-specific initialization function through the init field in the protocol block structure, proto, defined in file linux/include/net/sock.h. Here is the proto structure.

struct proto {

Most of the fields in this structure point to the protocol-specific operations. We will explain a few of these functions in a little more Details.

void (*close)(struct sock *sk,long timeout);
int (*connect)(struct sock
*sk,struct sockaddr *uaddr,int addr_len);
int (*disconnect)(struct sock *sk, int  flags);
struct sock * (*accept) (struct sock
*sk, int  flags, int *err);
int (*ioctl)(struct sock *sk, int cmd,unsigned long arg);

Init points to the protocol’s specific initialization function. This function is called when a socket is created for this protocol. Destroy points to the destructor function for this protocol. The destructor is executed when a socket for this protocol is closed.

int (*init)(struct sock *sk);
int (*destroy)(struct sock *sk);
void (*shutdown)(struct sock *sk, int how);
int (*setsockopt)(struct sock *sk, int level,
int optname, char *optval, int optlen);
int (*getsockopt)(struct sock *sk, int level,
int optname, char *optval,
int *option);
int (*sendmsg)(struct kiocb *iocb, struct  sock *sk,
struct msghdr *msg, int len);
int (*recvmsg)(struct kiocb *iocb, struct  sock *sk,
struct msghdr *msg,
int len, int noblock, int flags,
int *addr_len);
int (*sendpage)(struct sock *sk, struct page  *page,
int offset, size_t size, int flags);
int (*bind)(struct sock *sk,
struct sockaddr *uaddr, int addr_len);
int (*backlog_rcv) (struct sock *sk,
struct sk_buff *skb);

The following three functions are for keeping track of sock structures, looking them up, and getting the port number associated with the sock, respectively.

void (*hash)(struct sock *sk);
void (*unhash)(struct sock *sk);
int (*get_port)(struct sock *sk, unsigned  short snum);
char name[32];

Inuse indicates whether this sock structure is being used. For SMP implementations, there is one per CPU.

struct {
int inuse;
u8 __pad[SMP_CACHE_BYTES - sizeof(int)];
} stats[NR_CPUS];
} ;

The sk_prot field in the sock structure points to the protocol block structure. The init field in the proto structure is specific for each protocol and socket type within the AF_INET protocol family.

Socket Lockets—Individual Socket Locks

Each open socket has a locking mechanism to eliminate contention problems between the kernel main thread and the “bottom half,” or the various tasklets, timers, and interrupt handlers. As we know from earlier discussion, each open socket contains an instance of the sock structure. The individual socket lock is in the sk_lock field of the sock structure. The lock field is defined as type socket_lock_t, in file linux/include/linux/sock.h. The field, slock, is the actual spinlock. Users is set to one when the socket is locked, set to zero when the socket is unlocked, and wq is the queue of waiting tasks.

typedef struct {
spinlock_t slock;
unsigned int users;
wait_queue_head_t wq;
} socket_lock_t;

Generally, the lock is activated with the macro lock_sock.


The socket lock is released by calling the macro release_sock.


Both macros are defined in linux/include/linux/sock.h. When the socket is locked, incoming packets are blocked from being put on the receive queue. Instead, they are placed in the backlog queue for later processing.

IO System Calls and Sockets

Linux is a Unix-compatible operating system. Like other similar operating systems, Linux has a unified IO facility. All IO system calls are implementation independent; they work with devices, files, or sockets transparently. For most applications, no distinction is necessary. This has a major advantage in that the Linux application programmer does not have to remember three sets of API functions—one set for files, another for device IO, and a third for network protocols. This section discusses how file IO works with sockets and what is done in the socket layer itself to make this possible.

All IO devices, files, and other entities have a file descriptor. This file descriptor references an object in the file system called an inode, and all objects with file system-like behavior all have inodes. The inode allows the sockets to be associated with a Virtual File System (VFS). The inode structure is accessed with each IO system call such as read, write, fcntl, ioctl, and close. When an IO call is performed on an open socket, a pointer to the socket structure is retrieved from the inode. Remember that the open system call can’t be used to create a socket; instead, sockets must be created with the socket API function. Once a socket is created, the IO calls work the same way with sockets as they do for files or devices.

The inode is created and initialized by the sock_alloc call defined in file linux/net/socket.c. As discussed earlier in Section, sock_alloc is called from sock_create. Sock_alloc actually creates both a socket structure and an inode structure from the socket inode slab cache. The online functions SOCKET_I and SOCK_INODE are provided in linux/include/linux/socket.h to map an inode to a socket and a socket to an inode, respectively.

static inline struct socket *SOCKET_I(struct  inode *inode);
static inline struct inode *SOCK_INODE(struct socket *socket);

Once the inode is created, the socket layer can map the IO system calls. In socket.c, a file_operations structure, socket_file_ops, is created and initialized with pointers to socket versions of each of the IO system calls.

struct file_operations socket_file_ops = {
llseek: no_llseek,

Llseek is set to the generic "no" version of the socket call because it is not supported for sockets, and the error return is OK.

aio_read: sock_aio_read,
aio_write: sock_aio_write,
poll: sock_poll,
ioctl: sock_ioctl,
mmap: sock_mmap,

The open call is not supported for sockets. Open is set to sock_no_open to disallow opening a socket via the /proc file system. It returns an ENXIO error.

open: sock_no_open,
release: sock_close,
fasync: sock_fasync,
readv: sock_readv,
writev: sock_writev,
sendpage: sock_sendpage
} ;

Netlink and Rtnetlink

Netlink is an internal communication protocol. It mainly exists to transmit and receive messages between the application layer and various protocols in the Linux kernel. Netlink is implemented as a protocol with its own address family, AF_NETLINK. It supports most of the socket API functions. Rtnetlink is a set of message extensions to the basic netlink protocol messages. The most common use of netlink is for applications to exchange routing information with the kernel’s internal routing table. Netlink sockets are accessed like any other sockets. Both socket calls and system IO calls will work with netlink sockets. For example,the sendmsg and recvmsg calls are generally used by user-level applications to add and delete routes. Both these calls pass a pointer to the nlmsghdr structure in the msg argument.
This structure is defined in the file linux/include/linux/netlink.h.

struct nlmsghdr

Nlmsg_len is the length of the message including the header. The field, nlmsg_type, indicates the message content.

__u32 nlmsg_len;
__u16 nlmsg_type;

The next field, nlmsg_flags, are flags for the request. The values for nlmsg_flags are defined in Table The next field, nlmsg_seq, is the sequence number for the message.

__u16 nlmsg_flags;
__u32 nlmsg_seq;

Values for Netlink Message Flags

Nlmsg_pid is the sending Process Identification (PID) if the process is a user-level process, and zero if not.

__u32 nlmsg_pid;
} ;

The netlink protocol is implemented in the file linux/netlink/af_netlink.c. It is like any other protocol in the TCP/IP protocol suite, except that it is for exchanging messages between userlevel processes and internal kernel entities. It is similar to UDP or TCP in that it defines a proto_ops structure to bind internal calls with socket calls made through the AF_NETLINK address family sockets. The bindings are shown in the netlink_ops declaration also in the file af_netlink.c.

struct proto_ops netlink_ops = {

The address family type is PF_NETLINK.

.family: PF_NETLINK,
.owner: THIS_MODULE,

The protocol defines release, bind, and connect functions. Most of the other functions are not defined.

.release= netlink_release,
.bind= netlink_bind,
.connect= netlink_connect,
.socketpair= sock_no_socketpair,
.accept= sock_no_accept,
.getname= netlink_getname,
.poll= datagram_poll,
.ioctl= sock_no_ioctl,
.listen= sock_no_listen,
.shutdown= sock_no_shutdown,
.setsockopt= sock_no_setsockopt,
.getsockopt= sock_no_getsockopt,

Sendmsg and recvmsg are the main functions used to send and receive messages through AF_NETLINK sockets.

.sendmsg= netlink_sendmsg,
.recvmsg= netlink_recvmsg,
.mmap= sock_no_mmap,
.sendpage= sock_no_sendpage,
} ;

Just like other protocols, such as UDP and TCP that register with the socket layer, netlink address family declares a global instance of the net_proto_family structure in the file af_netlink.c.

struct net_proto_family netlink_family_ops =  {
.family = PF_NETLINK,
.create = netlink_create,
.owner = THIS_MODULE,
} ;

The netlink module also provides an initialization function for the protocol, netlink_proto_init.

static int __init netlink_proto_init(void);

This function registers the netlink family operations with the socket layer by calling sock_register.

In other chapters, we discuss how the routing tables and the neighbor cache are structured. In this section, we will show how the sendmsg and recvmsg functions are used with rtnetlink to pass requests for updates to the routing and the neighbor tables. All the operations break down to one of two fundamental operations: either retrieve the content of the table or post an update to the table. To support this, rtnetlink provides a structure defined in the file linux/include/linux/rtnetlink.h, called rtnetlink_link. This structure only contains only function pointers.

struct rtnetlink_link
int (*doit)(struct sk_buff *, struct  nlmsghdr*, void *attr);
int (*dumpit)(struct sk_buff *, struct  netlink_callback *cb);
} ;

Defined in the same file, rtnetlink also defines a global of instances of the preceding structure called rtnetlink_links. There are up to 32 of these instances, defined by NPROTO where each corresponds to a protocol type.

struct rtnetlink_link*rtnetlink_links[NPROTO];

To see how rtnetlink is used, let’s look at an example. If a utility running at the application layer caller wants to add a route to a internal routing table, it calls recvmsg with a pointer to an nlmsghdr passed as an argument and with the nlmsg_type field is set to RTM_NEWROUTE. The socket layer will gather the message into an sk_buff structure and send to the netlink protocol, which in turn will queue it to the receive queue of a PF_NETLINK socket. The function rtnetlink_rcv_skb in file rtnetlink.c will get the message when it is de-queued from the socket.

extern __inline__ int rtnetlink_rcv_skb(struct sk_buff *skb);

In this function, the first thing we do is to get the nlmsghdr from the data part of the sk_buff and call the function rtnetlink_rcv_msg.

static __inline__ int rtnetlink_rcv_msg(struct sk_buff
*skb, struct nlmsghdr *nlh,int *errp);

Rtnetlink_rcv_msg gets nlmsg_type from nlh, and this value is used to call through the doit function pointer to add the route to the table. There is actually a little more to this process because rtnetlink uses an attached rtmsg structure that includes more information about the routing protocol so we know which routing table to access.