## Overall Architecture Overview

![image-20250310194742752](theWholeLand.png)

> Thanks to Zhang Yanfei for the diagram.

After you basically understand what network card drivers, hardware interrupts, software interrupts, and ksoftirqd threads are, you can provide a kernel packet reception path diagram as shown above.

The general process is as follows:

1. When the network card receives data, it writes the received frames to memory using DMA, then sends an interrupt to the CPU to notify that data has arrived.
2. When the CPU receives the interrupt request, it calls the interrupt handler registered by the network device driver.
3. The network card's interrupt handler doesn't do much work - it issues a software interrupt request and quickly releases CPU resources.
4. The ksoftirqd kernel thread detects the software interrupt request, calls poll to start polling for packet reception, and forwards received packets to various protocol stack layers for processing. For TCP packets, they are placed in the user socket's receive queue.

## Foundation Work Before Everything Else

Before Linux drivers, kernel protocol stacks, and other modules can receive network card data packets, extensive preparation work must be done:

1. Pre-create ksoftirqd kernel threads;
2. Register processing functions for various protocols;
3. Pre-initialize the network device subsystem;
4. Start up the network card.

### Initialization Work

#### Creating ksoftirqd Kernel Threads

All Linux software interrupts are handled in dedicated kernel threads (ksoftirqd), so it's essential to understand how these threads are initialized.

First, there isn't just one thread, but N threads, where N equals the number of cores on your machine.

During system initialization, the `smpboot_register_percpu_thread` function is called in `kernel/smpboot.c`, which further executes `spawn_ksoftirqd` (located in `kernel/softirq.c`) to create softirqd threads, as shown below:

Related code:

```c
static struct smp_hotplug_thread softirq_threads = {
	.store			= &ksoftirqd,
	.thread_should_run	= ksoftirqd_should_run,
	.thread_fn		= run_ksoftirqd,
	.thread_comm		= "ksoftirqd/%u",
};

static __init int spawn_ksoftirqd(void)
{
	cpuhp_setup_state_nocalls(CPUHP_SOFTIRQ_DEAD, "softirq:dead", NULL,
				  takeover_tasklets);
	BUG_ON(smpboot_register_percpu_thread(&softirq_threads));

	return 0;
}
early_initcall(spawn_ksoftirqd);
```

After ksoftirqd is created, it enters its thread loop functions ksoftirqd_should_run and run_ksoftirqd, then checks if there are any software interrupts to process.

Software interrupts include not only network software interrupts but other types as well. The Linux kernel defines all software interrupt types in interrupt.h:

```c
// file: include/linux/interrupt.h
enum
{
	HI_SOFTIRQ=0,
	TIMER_SOFTIRQ,
	NET_TX_SOFTIRQ,
	NET_RX_SOFTIRQ,
	BLOCK_SOFTIRQ,
	IRQ_POLL_SOFTIRQ,
	TASKLET_SOFTIRQ,
	SCHED_SOFTIRQ,
	HRTIMER_SOFTIRQ,
	RCU_SOFTIRQ,    /* Preferable RCU should always be the last softirq */

	NR_SOFTIRQS
};
```

#### Network Subsystem Initialization

During network subsystem initialization, softnet_data is initialized for each CPU, and processing functions are registered for RX_SOFTIRQ and TX_SOFTIRQ, as shown in Figure 2.4.

The Linux kernel initializes various subsystems by calling subsys_initcall.

**Key Point!!! The network subsystem initialization mentioned here executes the net_dev_init function.**

This is the net_dev_init function in subsys_initcall(net_dev_init). Code as follows:

```c
static int __init net_dev_init(void)
{
	......

	/*
	 *	Initialise the packet receive queues.
	 */

  /*
   * Allocate a softnet_data structure for each CPU. The poll_list in this structure
   * is used to wait for driver programs to register their poll functions, which can
   * be seen later when network card drivers initialize.
   */
		for_each_possible_cpu(i) {
		struct softnet_data *sd = &per_cpu(softnet_data, i);

		memset(sd, 0, sizeof(*sd));
		skb_queue_head_init(&sd->input_pkt_queue);
		skb_queue_head_init(&sd->process_queue);
		sd->completion_queue = NULL;
		INIT_LIST_HEAD(&sd->poll_list);
    ......
  }
  ......
    /*
     * open_softirq registers a processing function for each type of software interrupt.
     * NET_TX_SOFTIRQ processing function is net_tx_action;
     * NET_RX_SOFTIRQ processing function is net_rx_action;
     */
  open_softirq(NET_TX_SOFTIRQ, net_tx_action);
	open_softirq(NET_RX_SOFTIRQ, net_rx_action);
}

subsys_initcall(net_dev_init);
```

Following open_softirq further reveals that this registration method is recorded in the softirq_vec variable. Later, when the softirqd thread receives software interrupts, it will use this variable to find the corresponding processing function for each type of software interrupt.

```c
void open_softirq(int nr, void (*action)(struct softirq_action *))
{
	softirq_vec[nr].action = action;
}
```

#### Protocol Stack Registration

#### Network Card Driver Initialization

Each driver uses `module_init` to register an initialization function with the kernel. When the driver is loaded, the kernel calls this function.

After completion, the Linux kernel knows the driver's relevant information, such as the igb network card driver's igb_driver_name and igb_probe function address.

When a network card device is identified, the kernel calls its driver's probe method. (Continuing with the igb network card driver example), the igb_driver's probe method is igb_probe.

The purpose of the igb_probe method is to get the device into a ready state as quickly as possible.

Additionally, there's a critical step: registering the poll function required by the NAPI mechanism, which for the igb network card driver is igb_poll.

### After Initialization Completion

#### Starting the Network Card

After all the above initialization is complete, the network card can be started. The general startup sequence is similar, as shown below:

igb_open code:

```c
static int __igb_open(struct net_device *netdev, bool resuming)
{
  // Allocate transmit descriptor arrays
  err = igb_setup_all_tx_resources(adapter);
  // Allocate receive descriptor arrays
	err = igb_setup_all_rx_resources(adapter);

  // Register interrupt handler
  err = igb_request_irq(adapter);
	if (err)
		goto err_req_irq;

  // Enable NAPI
  	for (i = 0; i < adapter->num_q_vectors; i++)
		napi_enable(&(adapter->q_vector[i]->napi));
	......
}
```

The igb_open function calls igb_setup_all_tx_resources and igb_setup_all_rx_resources. In the igb_setup_all_rx_resources step, RingBuffer is allocated and the mapping relationship between memory and Rx queues is established.

```c
static int igb_setup_all_rx_resources(struct igb_adapter *adapter)
{
	......

	for (i = 0; i < adapter->num_rx_queues; i++) {
		err = igb_setup_rx_resources(adapter->rx_ring[i]);
		...
	}

	return err;
}
```

Using a for loop with the igb_setup_rx_resources function, several queues are created. The igb_setup_rx_resources function:

```c
int igb_setup_rx_resources(struct igb_ring *rx_ring)
{
	struct device *dev = rx_ring->dev;
	int size;

	// 1. Allocate igb_rx_buffer array memory
	size = sizeof(struct igb_rx_buffer) * rx_ring->count;

	rx_ring->rx_buffer_info = vzalloc(size);
	if (!rx_ring->rx_buffer_info)
		goto err;

	/* Round up to nearest 4K */
	// 2. Allocate e1000_adv_rx_desc DMA array memory
	rx_ring->size = rx_ring->count * sizeof(union e1000_adv_rx_desc);
	rx_ring->size = ALIGN(rx_ring->size, 4096);

	rx_ring->desc = dma_alloc_coherent(dev, rx_ring->size,
					   &rx_ring->dma, GFP_KERNEL);
	if (!rx_ring->desc)
		goto err;

	// 3. Initialize queue members
	rx_ring->next_to_alloc = 0;
	rx_ring->next_to_clean = 0;
	rx_ring->next_to_use = 0;

	return 0;

err:
	vfree(rx_ring->rx_buffer_info);
	rx_ring->rx_buffer_info = NULL;
	dev_err(dev, "Unable to allocate memory for the Rx descriptor ring\n");
	return -ENOMEM;
}
```

From the above source code, you can see that internally, a RingBuffer doesn't have just one circular queue array, but two:

1. igb_rx_buffer array: This array is used by the kernel, allocated via vzalloc;
2. e1000_adv_rx_desc array: This array is used by network card hardware, allocated via dma_alloc_coherent.

Then there's the final step of interrupt function registration, which can be seen in igb_request_irq.

OK, that's all the preparation work! Next comes receiving data packets.

## Starting to Receive Data Packets

This section includes hardware interrupt processing.

### Hardware Interrupt Processing

First, when data frames arrive at the network card from the network cable, the first stop is the network card's receive queue. The network card searches for available memory locations in its allocated RingBuffer, and when found, the DMA engine will DMA the data to the memory previously associated with the network card. At this point, the CPU is unaware.

When the DMA operation completes, the network card sends a hardware interrupt to the CPU, notifying it that data has arrived. The hardware interrupt processing is as follows:

In the "Starting the Network Card" section mentioned earlier, the network card's hardware interrupt registered processing function is igb_msix_ring.

```c
// file: drivers/net/ethernet/intel/igb/igb_main.c
static irqreturn_t igb_msix_ring(int irq, void *data)
{
	struct igb_q_vector *q_vector = data;

	/* Write the ITR value calculated from the previous interrupt. */
	igb_write_itr(q_vector);

	napi_schedule(&q_vector->napi);

	return IRQ_HANDLED;
}
```

The igb_write_itr only records hardware interrupt frequency. Following the napi_schedule call, you'll discover that Linux only completes simple necessary work in hardware interrupts, leaving most processing to software interrupts. From the above code, you can see that the hardware interrupt processing is really very short - it just records a register, modifies the CPU's poll_list, and then issues a software interrupt. That's it, the hardware interrupt work is complete.

### ksoftirqd Kernel Thread Processing Software Interrupts

Network packet reception processing mainly occurs in the ksoftirqd kernel thread, where all software interrupts are processed, as shown below:

### Network Protocol Stack Processing

The netif_receive_skb function processes packets according to their protocol. For UDP packets, packets are sent sequentially to protocol processing functions like ip_rcv, udp_rcv, etc., as shown below:

### IP Layer Processing

Linux IP layer operations are in the code file `net/ipv4/ip_input.c`.

## Summary

The network module is the most complex module in the Linux kernel. The entire process involves interactions between many kernel components, such as network card drivers, protocol stacks, kernel ksoftirqd threads, etc. It looks complex, but the overall picture is actually quite clear. Simple summary as follows.

After a user executes a recvfrom call, the user process enters kernel mode through the system call. If the receive queue has no data, the process enters sleep state and is suspended by the operating system. This part is relatively simple. Next comes the work between various Linux kernel components.

First, before starting packet reception, Linux must do extensive preparation work:

- Create ksoftirqd kernel threads, set up their thread functions, and rely on them to handle software interrupts later;
- Protocol stack registration: Linux implements many protocols like ARP, ICMP, IP, UDP, and TCP. Each protocol registers its processing function, making it convenient to quickly find the corresponding processing function when packets arrive;
- Network card driver initialization: Each driver has an initialization function that the kernel will initialize. During this initialization process, prepare DMA and tell the kernel the NAPI poll function address;
- Start the network card: Allocate RX and TX queues, register interrupt corresponding processing functions.

After preparation work is complete, data arrives. The first to greet it is the network card:

- The network card DMAs data frames to memory's RingBuffer, then sends an interrupt notification to the CPU;
- CPU responds to the interrupt request, calls the interrupt processing function registered when the network card started;
- The interrupt processing function only issues a software interrupt request and does nothing else;
- Kernel thread ksoftirqd discovers a software interrupt request has arrived, first disables hardware interrupts;
- ksoftirqd thread starts calling the driver's poll function to receive packets;
- The poll function sends received packets to the protocol stack's registered ip_rcv function;
- The ip_rcv function sends packets to the udp_rcv function (for TCP packets, sent to tcp_rcv_v4).

## Some Conclusions

### Question 1: What exactly is RingBuffer, and why does RingBuffer drop packets?

RingBuffer is a special area in memory, a circular queue array. In fact, this data structure includes igb_rx_buffer circular queue arrays, e1000_adv_rx_desc circular queue arrays, and numerous skbs.

If RingBuffer represents pointer arrays, they are pre-allocated. If they are skbs, they are dynamically allocated during the packet reception process.

### Question 2: What are software interrupts and hardware interrupts respectively?

Key flow of data packet reception in Linux network stack:

1. **Hardware Stage**: Network card places received data packets into RingBuffer
2. **Hardware Interrupt Trigger**: Network card generates hardware interrupt to notify CPU
3. **Hardware Interrupt Processing**: Add network card device to `poll_list` doubly linked list in `softnet_data` structure
4. **Software Interrupt Trigger**: Trigger `NET_RX_SOFTIRQ` software interrupt
5. **Software Interrupt Processing**: Traverse `poll_list`, execute network card driver's `poll` function to collect network packets
6. **Protocol Stack Processing**: Forward data packets to protocol processing functions like `ip_rcv`, `udp_rcv`, `tcp_rcv_v4`

This describes the Linux NAPI (New API) mechanism, an efficient method for handling network data packets.

#### Actual Application of RingBuffer in Network Stack

##### Network Card RX/TX Ring Buffers

```c
/* Simplified network card RX Ring structure */
struct e1000_rx_desc {
    __le64 buffer_addr;    /* Data buffer address */
    __le16 length;         /* Packet length */
    __le16 checksum;       /* Checksum */
    __u8  status;          /* Descriptor status */
    __u8  errors;          /* Error code */
    __le16 special;
};
```

In practice, an Intel network card's RX Ring might contain 256 such descriptors, forming a circular structure.

##### Actual Example of Hardware and Software Interrupt Cooperation

In Intel 82599 network card driver:

```c
/* Hardware interrupt handler */
static irqreturn_t ixgbe_msix_lsc(int irq, void *data)
{
    struct net_device *netdev = data;

    /* Disable network card interrupt */
    ixgbe_disable_interrupt();

    /* Add device to poll_list */
    napi_schedule(&adapter->q_vector[vector]->napi);

    return IRQ_HANDLED;
}

/* NAPI poll function */
static int ixgbe_poll(struct napi_struct *napi, int budget)
{
    struct ixgbe_q_vector *q_vector = container_of(napi, struct ixgbe_q_vector, napi);
    struct ixgbe_adapter *adapter = q_vector->adapter;
    int work_done = 0;

    /* Batch receive packets from RingBuffer, process at most budget packets */
    work_done = ixgbe_clean_rx_irq(q_vector, budget);

    /* If work is incomplete, stay in poll_list */
    if (work_done < budget) {
        napi_complete(napi);
        ixgbe_enable_interrupt();
    }

    return work_done;
}
``` 

After you basically understand what network card drivers, hardware interrupts, software interrupts, and ksoftirqd threads are, you can derive a basic packet reception path...

How the Kernel Receives Network Packets

Comments