Consolidating Packet Forwarding Services with Data-Plane Development Software

Consolidating all three planes to a single ATCA blade is now possible.

By Jack Lin, Yunxia Guo, and Xiang Li, ADLINK

In recent years, there has been a market and technology trend towards the convergence of network infrastructure to a common platform or modular components that support multiple network elements and functions, such as application processing, control processing, packet processing and signal processing. In addition to cost savings and reduced time-to-market, this approach provides the flexibility of modularity and the ability to independently upgrade system components where and when needed, using a common platform or modular components in shelf systems and networks of varying sizes. In traditional networks, switching modules would be used to route traffic between in-band system modules and out-of-band systems; processor modules used for applications and control-plane functions; packet processing modules used for data-plane functions; and DSP modules used for specialized signal-plane functions. Four different types.

Enhancements to processor architecture and the availability of new software development tools are enabling developers to use a single blade architecture for consolidation of all their application, control and packet-processing workloads. Huge performance boosts achieved by this hardware/software combination are making the processor blade architecture increasingly viable as a packet-processing solution. To illustrate this evolution, we developed a series of tests to verify that an AdvancedTCA processor blade combined with an Intel® Data Plane Development Kit (Intel® DPDK) supplied by the CPU manufacturer can provide the required performance and consolidate IP forwarding services using a single platform. In summary, we compared the Layer3 forwarding performance of an ATCA blade using native Linux IP forwarding without any additional optimization from software with that obtained using the Intel DPDK. We then analyzed the reasons behind the gains in IP forwarding performance achieved using the Intel DPDK.

AdvancedTCA Processor Blade
The ATCA blade used in this study is a highly integrated processor blade with dual x86 processors, each with 8 cores (16 threads) and supporting eight channels of DDR3-1600 VLP RDIMM for a maximum system memory capacity of 64GB per processor. Network I/O features include two 10Gigabit Ethernet ports (XAUI, 10GBase-KX4) compliant with PICMG 3.1 option 1/9, and up to six Gigabit Ethernet 10/100/1000BASE-T ports to the front panel. The detailed architecture of the ATCA blade is illustrated in the functional block diagram in Figure 1.

Figure 1: ADLINK aTCA-6200 functional block diagram


Intel® Data Plane Development Kit
The Intel DPDK provides a lightweight run-time environment for x86 architecture processors, offering low overhead and run-to-completion mode to maximize packet-processing performance. The environment provides a rich selection of optimized and efficient libraries, also known as the environment abstraction layer (EAL), which are responsible for initializing and allocating low-level resources, hiding the environment specifics from the applications and libraries, and gaining access to the low-level resources such as memory space, PCI devices, timers and consoles.

The EAL provides an optimized poll mode driver (PMD); memory & buffer management;and timer, debug and packet-handling APIs, some of which may also be provided by the Linux OS. To facilitate interaction with application layers, the EAL, together with standard the GNU C Library (GLIBC), provide full APIs for integration with higher level applications. The software hierarchy is shown in Figure 2.

Figure 2: EAL and GLIBC in Linux application environment

Test Topology
In order to measure the speed at which the ATCA processor blade can process and forward IP packets at the Layer3 level, we used the following test environment shown in Figure 3.

Figure 3: IP Forwarding Test Environment


Two ATCA switch blades with networking software provided non-blocking interconnection switches for the 10GbE Fabric and 1GbE Base Interface channels of all three processor blades in the ATCA shelf, which supports a full-mesh topology. Therefore, each switch blade can provide at least one Fabric and Base interface connection to each processor blade. A test system, compliant with RFC2544 for throughput benchmarking, was used as a packet simulator to send IP packets with different frame sizes and collect the final statistical data, such as frames per second and throughput.

As shown in the topology of the test environment in Figure 3, the ATCA processor blade (device under test: DUT) has four Gigabit Ethernet interfaces: two directly from the front panel (Flow1 and Flow2), and another two from the Base Interfaces (Flow3 and Flow4) via the DUT’s Base switches. In addition to these four 1GbE interfaces, the DUT has two 10GbE interfaces connected to the test system via the switch blade.

In our test environment, the DUT was responsible for receiving IPv4 packets from the test system, processing these packets at the Layer3 level (e.g., packet de-encapsulation, IPv4 header checksum validation, route table look-up and packet encapsulation), then finally sending the packets back to the test system according to the routing table look-up result. All six flows are bi-directional: for example, the test system sends frames from Interface 1/2/3/4/5/6 to the DUT and receives frames via Interface 2/1/4/3/6/5, respectively.

Test Methodology
To evaluate how the Intel DPDK consolidates packet-forwarding services on the processor blade, an IP forwarding application based on the Intel DPDK was used in the following two test cases:

Performance with native Linux
In this test, UbuntuServer 11.10 64-bit was installed on the ATCA processor blade.

Performance with Intel DPDK
The Intel DPDK can be run in different modes, such as Bare Metal, Linux with Bare Metal Run-Time and Linux User Space. The Linux User Space mode is the easiest to use in the initial development stages. Details of how the Intel DPDK functions in Linux User Space Mode are shown in Figure 4.

Figure 4: Intel DPDK running in Linux User Space Mode

After compiling the Intel DPDK target environment, an IP forwarding application can be run as a Linux User Space application.

After testing the ATCA processor blade under native Linux and with the Intel DPDK provided by the CPU manufacturer, we compared the IP forwarding performance in these two configurations from the four 1GbE interfaces (2 from the front panel and 2 from the Base Interfaces) and two 10GbE Fabric Interfaces. In addition, we benchmarked the combined IPv4 forwarding performance of the processor blade using all six interfaces simultaneously (four 1GbE interfaces and two 10GbE interfaces).

Performance comparison using four 1GbE interfaces
When running IPv4 forwarding on the four 1GbE interfaces of the processor blade with native Linux IP forwarding enabled, a rate of 1 million frames per second can be sustained with a frame size of 64 bytes. As the frame size is increased to 1024 bytes, native Linux IP forwarding can approach 100% of the line rate. But in the real world, frame sizes are usually smaller than 1024 bytes, so 100% line rate forwarding is not achievable. However, with the Intel DPDK running on only two CPU threads under the same Linux OS, the processor blade can forward frames at 100% line speed without any frames lost regardless of the frame size setting, as shown in Figure 5.



The ATCA processor blade running the Intel DPDK provides almost 6 times the IP forwarding performance compared to native Linux IP forwarding.

Performance comparison using two 10GbE interfaces
Running the IP forwarding test on the two 10GbE Fabric Interfaces shows an even greater performance gap between native Linux and Intel DPDK-based IP forwarding than that using four 1GbE interfaces. As shown in Figure 6, the processor blade with Intel DPDK running on only two threads provides a gain of more than 10 times IP forwarding performance compared to native Linux using all available CPU threads.



Total IPv4 forwarding performance of the processor blade
Testing the combined IP forwarding performance of the processor blade using all available interfaces (two 10GbE Fabric Interfaces, two 1GbE front panel interfaces and two 1GbE Base Interfaces), the processor blade with the Intel DPDK can forward up to 27 million frames per second when the frame size is set to 64 bytes. In other words, up to 18Gbps of the theoretical 24Gbps throughput can be forwarded (i.e., 75.3% of the line rate). Furthermore, the throughput in terms of the line rate increases to 92.3%, even up to 99%, when the frame size is set to 128 bytes and 256 bytes respectively.



The reasons why the Intel DPDK can consolidate more powerful IP forwarding performance than available with native Linux come mainly from the design features described below.

Polling mode instead of interrupts
Generally, when packets come in, native Linux receives interrupts from the network interface controller (NIC), schedules the softIRQ, proceeds with context switching, and invokes system calls such as read() and write().

In contrast, the Intel DPDK uses an optimized poll mode driver (PMD) instead of the default Ethernet driver to pull the incoming packets continuously, avoiding software interrupts, context switching and invoking of system calls. This saves significant CPU resources and reduces latency.

Hugepage instead of traditional pages
Compared to the 4 kB pages of native Linux, using larger pages means time savings for page look-ups and the reduced possibility of a translation look aside buffer (TLB) cache miss.

The Intel DPDK runs as a user-space application by allocating huge pages in its own memory zone to store frame buffer, ring and other related buffers, that are out of the control of other applications, even the Linux kernel. In the test described in this white paper, a total of 1024@2MB huge pages are reserved for running IP forwarding applications.

Zero-copy buffers
In traditional packet processing, native Linux decapsulates the packet header, and then copies the data to the user space buffer according to the socket ID. Once the user space application finishes processing the data, a write system call is invoked to send out data to the kernel, which takes charge of copying data from the user space buffer to the kernel buffer, encapsulates the packet header and finally sends it out via the relevant physical port. Obviously, the native Linux process sacrifices time and resources on buffer copies between kernel and user space buffers.

In comparison, the Intel DPDK receives packets at its reserved memory zone, which is located in the user-space buffer, and then classifies the packets to each flow according to configured rules without copying to the kernel buffer. After processing the decapsulated packets, it encapsulates the packets with the correct headers in the same user-space buffer, and finally sends them out to the relevant physical ports.

Run-to-implement and core affinity
Prior to running applications, the Intel DPDK initializes to allocate all low-level resources, such as memory space, PCI device, timers, consoles, which are reserved for Intel DPDK-based applications only. After initialization, each of the cores are launched to take over each execute unit, which run the same or different workloads, depending on the actual application requirements.

Moreover, the Intel DPDK provides a way to set each execute unit running in each core to keep more core affinity, thus avoiding cache misses. In the tests described, the physical ports of the processor blade are bound to two different CPU threads according to affinity.

Lockless implement and cache alignment
The libraries or APIs provided by the Intel DPDK are optimized to be lockless to prevent dead locks for multi-thread applications. For buffer, ring and other data structures, it also optimizes them to be cache aligned to maximize cache-line efficiency and minimize cache-line contention.

By analyzing the results of our tests using the ATCA processor blade’s four 1GbE interfaces and two 10GbE Fabric Interfaces with and without the Intel DPDK provided by the CPU manufacturer (Figures 5 and 6), we can conclude that running Linux with the Intel DPDK and using only two CPU threads for IP forwarding can achieve approximately 10 times the IP forwarding performance of that achieved by native Linux with all CPU threads running on the same hardware platform.

As is evident in Figure 7, the IPv4 forwarding performance achieved by the processor blade with the Intel DPDK makes it cost- and performance-effective for customers to migrate their packet processing applications from network processor-based hardware to x86-based platforms, and use a uniform platform to deploy different services, such as application processing, control processing and packet processing services.



Jack Lin is the team manager of Platform Integration and Validation, Embedded Computing Product Segment, which focuses on validating ADLINK building blocks and integrating application-ready platforms for end customers. He holds a B.S. and M.S. in information and communication engineering from Beijing JiaoTong University. Prior to joining ADLINK, he worked for Intel and Kasenna.





Yunxia Guo is a PIV software system engineer in ADLINK's Embedded Computing Product Segment and holds a B.S. in communication engineering from Hubei University of Technology and an M.S. in information and communication engineering from Wuhan University of Technology.





Xiang Li is a member of the platform integration and validation team in ADLINK's Embedded Computing Product Segment. He holds a B.S. in electronic and information engineering from Shanghai Tongji University.