Creation of a Socket - Linux

Before the user can perform any operations with the TCP/IP stack, she creates a new socket by calling the socket API function. Sockets are generic and not necessarily associated with TCP/IP, so quite a few things happen before any code in the TCP/IP stack itself gets called. Sockets can be created for many protocol types other than TCP/IP. This section will describe what happens under the hood in the Linux kernel when the function socket is called. Earlier in this chapter, we described how the socket API functions get mapped to the protocol family-specific functions. It is through this mapping that the sys_socket function executes an AF_INET specific function for the TCP/IP protocol family. First, the socket layer is responsible for activities that are not specific to a particular protocol family like AF_INET. After the generic initialization is complete, socket creation will call the socket creation function for the protocol-specific family. After we discuss the generic socket creation, we will discuss what happens during socket creation for the AF_INET protocol family.

Sock_create, defined in file linux/net/socket.c, is called from sys_socket. This function initiates the creation of a new socket.

int sock_create(int family,int type,int protocol,struct socket **res);

First, sock_create verifies that family is one of the allowed family types shown in Table. Then, it allocates a socket by calling sock_alloc, which returns a new socket structure, sock. See Section for a discussion of the socket structure. Sock_alloc, called from sock_create, returns an allocated socket structure. The socket structure is actually part of an inode structure, created when sock_alloc calls new_inode. It is necessary to have an inode for Linux IO system calls to work with sockets. Section explores the IO system call mapping in more detail. Most of the fields in the inode structure are important only to "real" filesystems; however, a few of them are used by sockets. The fields in the inode structure that are used for sockets are listed in Table

Inode Structure Fields Used by Sockets

Inode Structure Fields Used by SocketsInode Structure Fields Used by Sockets

Once the inode is created, sock_alloc retrieves the socket structure from the inode. Then, it initializes a few fields in the socket structure. The Inode field is set to point back to the inode structure containing this socket structure, and fasync_list is set to NULL. Fasync_list is for supporting the fsync system call for synchronizing the in-memory portions of a file with permanent storage. For a socket, fsync will flush the buffered data in the socket layer. Posix.1b defines the behavior of the fsync system call. See [GALL95] for more information about Posix.1b.

Sockets maintain a state related to whether an open socket represents a connection to a peer or not. These states are maintained in the field in the socket structure called, state, which is initialized to SS_UNCONNECTED when the socket is created. See Table for a description of the socket states. These states are defined in the file, linux/include/linux/net.h.

Socket States

Socket States

It is important to remember that sockets are not just for TCP/IP. The states maintained in the socket structure do not contain all the same states as TCP, which is quite a bit more complicated. The socket state really only reflects whether there is an active connection. Sockets support other protocol families and must be generic. The internal logic to manage the protocol attached to a new socket will be maintained in the protocol itself, and TCP is no exception. However, since the socket layer must support several protocols, it requires some internal connection management logic in addition to what is in the internal TCP implementation.

For now, we set ops to NULL. Later, when the family member protocol-specific create function is called, ops will be set to the set of protocol-specific operations. Sock_alloc initializes a few more fields before returning. The flags field is initialized to zero because no flags are set yet. Later, the application will specify the flags by calling one of the send or receive socket API functions. Table contains a description of the flags used with the socket API. Finally, two other fields, sk and file, are initialized to NULL, but they both deserve a little attention. The first of these fields, sk, will be set later to point to the internal sock structure by the protocol-specific create function. File will be set to a file pointer allocated when sock_map_fd is called. The file pointer is used to maintain the state of the pseudo file associated with the open socket. After returning from the sock_alloc call, sock_create calls the create function for the protocol family. It accesses the array net_families to get the family’s create function. For TCP/IP, family will be set to AF_INET.

Creation of an AF_INET Socket

The create function for the TCP/IP protocol family, AF_INET is inet_create, is defined in file af_inet.c.

static int inet_create(struct socket *sock,int protocol);

Inet_create is defined as static because it is not called directly. Instead, it is called through the create field in the net_proto_family structure for the AF_INET protocol family. Inet_create is called from sys_socket when family is set to AF_INET. In inet_create, we create a new sock structure called sk and initialize a few more fields. The new sock structure is allocated from the slab cache, inet_sk_slab. Linux has multiple inet_sk_slabs, one for each protocol. Linux slab caches are more efficient if they are specific for the purpose, so, most fields can be pre-initialized to common values. We call sk_alloc to allocate the sock structure from the slab cache that is specific to the protocol for this socket.

sk = sk_alloc(PF_INET, GFP_KERNEL, inet_sk_size(protocol)inet_sk_slab(protocol));

Sk points to the new slab cache. Next, inet_create searches the protocol switch table to look for a match from the protocol.

After getting the result from the search of the protocol switch table, the capability flags are checked against the capabilities of the current process, and if the caller doesn’t have permission to create this type of socket, the user level socket call will return the EPERM error.

Now, inet_create will set some fields in the new sock data structure, however, many fields are pre-initialized when allocation is done from the slab cache. The field sk_family is set to PF_INET. The prot field is set to the protocol’s protocol block structure that defines the specific function for each of the transport protocols. No_check and ops are set according to their respective values in the protocol switch table. If the type of the socket is SOCK_RAW, the num field is set to the protocol number. As will be shown in later chapters, this field is used by IP to route packets internally depending on whether there is a raw socket open. The sk_destruct field of sk is set to inet_sock_destruct, the sock structure destructor. The sk_backlog_rcv field is set to point to the protocol-specific backlog receive function. Next, some fields in the protocol familyspecific part of the sock structure are initialized. As discussed in Section the sock structure is followed by a protocol-specific portion for each of the two protocol families, IPv4 and IPv6, and this part is accessed through a macro, inet_sk, which is defined in the file linux/include/linux/ip.h.

#define inet_sk(__sk) (&((struct inet_sock *)__sk)->inet)

The inet_opt structure for IPv4 is also defined in the file linux/include/linux/ip.h.

struct inet_opt {

These first few fields are for socket layer de-multiplexing of incoming packets. Daddr is the peer IPv4 address, and rcv_saddr is the bound local IPv4 address. Dport is the destination port, num is the local port, and saddr is the source address.

__u32 daddr; __u32 rcv_saddr; __u16 dport; __u16 num; __u32 saddr;

The uc_ttl field is for setting the time-to-live field in the IP header.

int uc_ttl;

Tos is for setting the type of service field in the IP header. The cmsg_flags field is used by setsockopt and getsockopt to communicate network layer IP socket options to the IP protocol layer. In addition, ip_options points to the values associated with the options set in the cmsg_flags field.

int tos; unsigned cmsg_flags; struct ip_options *opt;

Sport is the source port number.

__u16 sport;

Hdrincl is for raw sockets. It states that the IP header is included in the packet delivered to the application level.

unsigned char hdrincl;

This field is the multicasting time-to-live.

__u8 mc_ttl;

The next field indicates whether multicast packets should be looped back. Pmtudisc indicates whether MTU discovery should be performed on this interface.

__u8 mc_loop; __u8 pmtudisc;

The next field, id, contains the counter for identification field in the IP header. This field is used by the receiving machine for re-assembling fragmented IP packets.

__u16 id; unsigned recverr : 1, freebind : 1;

Mc_index is the index for the output network interface used for transmission of multicast packets, and mc_addr is the source address used for outgoing packets sent to a multicast address.

int mc_index; __u32 mc_addr;

This field, mc_list, points to the list of multicast address groups to which the interface has subscribed.

struct ip_mc_socklist *mc_list;

The next field is the cached page from the sendmsg socket function, and sndmsg_off is the offset into the page.

struct page *sndmsg_page; u32 sndmsg_off;

The following structure keeps information about the IP options needed to build an IP header on each outgoing IP fragment. Since all the fragments have almost identical headers, the options are kept here to speed the process of building IP headers on consecutive fragments. It is called cork, because the socket is "corked," waiting for all fragments of the total IP datagram to be transmitted.

This field, length, is the total length of all frames in the fragmented IP datagram.

Some fields in the inet_opt are initialized by inet_create. Some of these fields are related to multicast transmission, In each case, inet points to the instance of the inet_opt structure shown previously.

The default time-to-live field, mc_ttl is initialized because this value will be used in the time-tolive IP header field for multicast packets. Even though these values are initialized here, the application may change the values in these fields later through the setsockopt call. Finally, inet_create calls the protocol-specific initialization function through the init field in the protocol block structure, proto, defined in file linux/include/net/sock.h. Here is the proto structure.

struct proto {

Most of the fields in this structure point to the protocol-specific operations. We will explain afew of these functions in a little more Details.

Init points to the protocol’s specific initialization function. This function is called when a socket is created for this protocol. Destroy points to the destructor function for this protocol. The destructor is executed when a socket for this protocol is closed.

The following three functions are for keeping track of sock structures, looking them up, and getting the port number associated with the sock, respectively.

Inuse indicates whether this sock structure is being used. For SMP implementations, there is one per CPU.

The sk_prot field in the sock structure points to the protocol block structure. The init field in the proto structure is specific for each protocol and socket type within the AF_INET protocol family.

Socket Lockets—Individual Socket Locks

Each open socket has a locking mechanism to eliminate contention problems between the kernel main thread and the “bottom half,” or the various tasklets, timers, and interrupt handlers. As we know from earlier discussion, each open socket contains an instance of the sock structure. The individual socket lock is in the sk_lock field of the sock structure. The lock field is defined as type socket_lock_t, in file linux/include/linux/sock.h. The field, slock, is the actual spinlock. Users is set to one when the socket is locked, set to zero when the socket is unlocked, and wq is the queue of waiting tasks.

Generally, the lock is activated with the macro lock_sock.


The socket lock is released by calling the macro release_sock.


Both macros are defined in linux/include/linux/sock.h. When the socket is locked, incoming packets are blocked from being put on the receive queue. Instead, they are placed in the backlog queue for later processing.

All rights reserved © 2020 Wisdom IT Services India Pvt. Ltd Protection Status

Linux Topics