简体繁体 English

Linux中的高性能数据包处理

[英]High performance packet handling in Linux

原文 2015-01-14 17:31:23 1 2 linux/ performance/ linux-kernel/ network-programming/ arm

I'm working on a packet reshaping project in Linux using the BeagleBone Black. 我正在使用BeagleBone Black在Linux中进行数据包重塑项目。 Basically, packets are received on one VLAN, modified, and then are sent out on a different VLAN. 基本上，数据包在一个VLAN上接收，修改，然后在另一个VLAN上发送出去。 This process is bidirectional - the VLANs are not designated as being input-only or output-only. 此过程是双向的-VLAN未指定为仅输入或仅输出。 It's similar to a network bridge, but packets are altered (sometimes fairly significantly) in-transit. 它类似于网桥，但是在传输过程中数据包被更改（有时相当明显）。

I've tried two different methods for accomplishing this: 我尝试了两种不同的方法来实现此目的：

Creating a user space application that opens raw sockets on both interfaces. 创建一个在两个接口上都打开原始套接字的用户空间应用程序。 All packet processing (including bridging) is handled in the application. 所有数据包处理（包括桥接）都在应用程序中处理。
Setting up a software bridge (using the kernel bridge module) and adding a kernel module that installs a netfilter hook in post routing (NF_BR_POST_ROUTING). 设置软件桥（使用内核桥模块）并添加一个内核模块，该模块在后路由（NF_BR_POST_ROUTING）中安装一个netfilter挂钩。 All packet processing is handled in the kernel. 所有的数据包处理都在内核中进行。

The second option appears to be around 4 times faster than the first option. 第二个选项似乎比第一个选项快4倍。 I'd like to understand more about why this is. 我想进一步了解这是为什么。 I've tried brainstorming a bit and wondered if there is a substantial performance hit in rapidly switching between kernel and user space, or maybe something about the socket interface is inherently slow? 我试过集思广益，想知道在内核和用户空间之间快速切换是否会对性能产生重大影响，或者套接字接口的固有速度是否很慢？

I think the user application is fairly optimized (for example, I'm using PACKET_MMAP), but it's possible that it could be optimized further. 我认为用户应用程序已经过优化（例如，我正在使用PACKET_MMAP），但是有可能对其进行进一步优化。 I ran perf on the application and noticed that it was spending a good deal of time (35%) in v7_flush_kern_dcache_area, so perhaps this is a likely candidate. 我在应用程序上运行了性能，并发现它在v7_flush_kern_dcache_area中花费了大量时间（35％），因此也许是一个可能的选择。 If there are any other suggestions on common ways to optimize packet processing I can give them a try. 如果对优化包处理的常用方法还有其他建议，我可以尝试一下。

2 个解决方案

Context switches are expensive and kernel to user space switches imply a context switch. 上下文切换非常昂贵，内核到用户空间的切换意味着上下文切换。 You can see this article for exact numbers, but the stated durations are all in the order of microseconds. 您可以看到本文的确切数字，但规定的持续时间都以微秒为单位。

You can also use lmbench to benchmark the real cost of context switches on your particular cpu. 您还可以使用lmbench基准测试特定cpu上上下文切换的实际成本。

The performance of the user space application depends on the used syscall to monitor the sockets too. 用户空间应用程序的性能也取决于所使用的syscall来监视套接字。 The fastest syscall is epoll() when you need to handle a lot of sockets. 当您需要处理大量套接字时，最快的syscall是epoll（）。 select() will perform very poor, if you handle a lot of sockets. 如果您处理许多套接字，select（）的性能将很差。

See this post explaining it: Why is epoll faster than select? 看到这篇文章解释它：为什么epoll比select更快？