为什么设置 SO_SNDBUF 和 SO_RCVBUF 会破坏性能？

Question

Running in Docker on a MacOS, I have a simple server and client setup to measure how fast I can allocate data on the client and send it to the server.在 MacOS 上的 Docker 中运行，我有一个简单的服务器和客户端设置来测量我在客户端上分配数据并将其发送到服务器的速度。 The tests are done using loopback (in the same docker container).测试是使用环回完成的（在同一个 docker 容器中）。 The message size for my tests was 1000000 bytes.我测试的消息大小为 1000000 字节。

When I set SO_RCVBUF and SO_SNDBUF to their respective defaults, the performance halves.当我将 SO_RCVBUF 和 SO_SNDBUF 设置为各自的默认值时，性能减半。

SO_RCVBUF defaults to 65536 and SO_SNDBUF defaults to 1313280 (retrieved by calling getsockopt and dividing by 2). SO_RCVBUF 默认为 65536，SO_SNDBUF 默认为 1313280（通过调用 getsockopt 并除以 2 检索）。

Tests:测试：

When I test setting neither buffer size, I get about 7 Gb/s throughput.当我测试既不设置缓冲区大小时，我得到大约 7 Gb/s 的吞吐量。
When I set one buffer or the other to the default (or higher) I get 3.5 Gb/s.当我将一个或另一个缓冲区设置为默认值（或更高）时，我得到 3.5 Gb/s。
When I set both buffer sizes to the default I get 2.5 Gb/s.当我将两个缓冲区大小都设置为默认值时，我得到 2.5 Gb/s。

Server code: (cs is an accepted stream socket)服务器代码：（cs 是可接受的 stream 套接字）

void tcp_rr(int cs, uint64_t& processed) {
    /* I remove this entire thing and performance improves */
    if (setsockopt(cs, SOL_SOCKET, SO_RCVBUF, &ENV.recv_buf, sizeof(ENV.recv_buf)) == -1) {
        perror("RCVBUF failure");
        return;
    }
    char *buf = (char *)malloc(ENV.msg_size);
    while (true) {
        int recved = 0;
        while (recved < ENV.msg_size) {
            int recvret = recv(cs, buf + recved, ENV.msg_size - recved, 0);
            if (recvret <= 0) {
                if (recvret < 0) {
                    perror("Recv error");
                }
                return;
            }
            processed += recvret;
            recved += recvret;
        }
    }
    free(buf);
}

Client code: (s is a connected stream socket)客户端代码：（s 是一个已连接的 stream 套接字）

void tcp_rr(int s, uint64_t& processed, BenchStats& stats) {
    /* I remove this entire thing and performance improves */
    if (setsockopt(s, SOL_SOCKET, SO_SNDBUF, &ENV.send_buf, sizeof(ENV.send_buf)) == -1) {
        perror("SNDBUF failure");
        return;
    }
    char *buf = (char *)malloc(ENV.msg_size);
    while (stats.elapsed_millis() < TEST_TIME_MILLIS) {
        int sent = 0;
        while (sent < ENV.msg_size) {
            int sendret = send(s, buf + sent, ENV.msg_size - sent, 0);
            if (sendret <= 0) {
                if (sendret < 0) {
                    perror("Send error");
                }
                return;
            }
            processed += sendret;
            sent += sendret;
        }
    }
    free(buf);
}

Zeroing in on SO_SNDBUF:在 SO_SNDBUF 上归零：
The default appears to be: net.ipv4.tcp_wmem = 4096 16384 4194304默认似乎是：net.ipv4.tcp_wmem = 4096 16384 4194304

If I setsockopt to 4194304 and getsockopt (to see what's currently set) it returns 425984 (10x less than I requested).如果我将 setockopt 设置为 4194304 并 getsockopt（查看当前设置的内容），它将返回 425984（比我要求的少 10 倍）。

Additionally, it appears a setsockopt sets a lock on buffer expansion (for send, the lock's name is SOCK_SNDBUF_LOCK which prohibits adaptive expansion of the buffer).此外，似乎 setsockopt 在缓冲区扩展上设置了一个锁（对于发送，锁的名称是 SOCK_SNDBUF_LOCK，它禁止缓冲区的自适应扩展）。 The question then is - why can't I request the full size buffer?那么问题是 - 为什么我不能请求完整大小的缓冲区？

Answer 1

Clues for what is going on come from the kernel source handle for SO_SNDBUF (and SO_RCVBUF but we'll focus on SO_SNDBUF below).正在发生的事情的线索来自 SO_SNDBUF 的 kernel 源句柄（和 SO_RCVBUF，但我们将在下面关注 SO_SNDBUF）。

net/core/sock.c contains implementations for the generic socket operations, including the SOL_SOCKET getsockopt and setsockopt handles. net/core/sock.c 包含通用套接字操作的实现，包括 SOL_SOCKET getsockopt 和 setsockopt 句柄。

Examining what happens when we call setsockopt(s, SOL_SOCKET, SO_SNDBUF, ...) :检查当我们调用setsockopt(s, SOL_SOCKET, SO_SNDBUF, ...)时会发生什么：

        case SO_SNDBUF:
                /* Don't error on this BSD doesn't and if you think
                 * about it this is right. Otherwise apps have to
                 * play 'guess the biggest size' games. RCVBUF/SNDBUF
                 * are treated in BSD as hints
                 */
                val = min_t(u32, val, sysctl_wmem_max);
set_sndbuf:
                sk->sk_userlocks |= SOCK_SNDBUF_LOCK;
                sk->sk_sndbuf = max_t(int, val * 2, SOCK_MIN_SNDBUF);
                /* Wake up sending tasks if we upped the value. */
                sk->sk_write_space(sk);
                break;

        case SO_SNDBUFFORCE:
                if (!capable(CAP_NET_ADMIN)) {
                        ret = -EPERM;
                        break;
                }
                goto set_sndbuf;

Some interesting things pop out.一些有趣的事情浮出水面。

First of all, we see that the max possible value is sysctl_wmem_max , a setting which is difficult to pin down within a docker container.首先，我们看到最大可能值为sysctl_wmem_max ，这是在 docker 容器中难以确定的设置。 We know from the context above that this is likely 212992 (half your max value you retrieved after trying to set 4194304).我们从上面的上下文中知道，这可能是 212992（尝试设置 4194304 后检索到的最大值的一半）。

Secondly, we see SOCK_SNDBUF_LOCK being set.其次，我们看到SOCK_SNDBUF_LOCK被设置。 This setting is in my opinion not well documented in the man pages, but it appears to lock dynamic adjustment of the buffer size.我认为此设置在手册页中没有很好地记录，但它似乎锁定了缓冲区大小的动态调整。

For example, in the function tcp_should_expand_sndbuf we get:例如，在 function tcp_should_expand_sndbuf我们得到：

static bool tcp_should_expand_sndbuf(const struct sock *sk)
{
        const struct tcp_sock *tp = tcp_sk(sk);

        /* If the user specified a specific send buffer setting, do
         * not modify it.
         */
        if (sk->sk_userlocks & SOCK_SNDBUF_LOCK)
                return false;
...

So what is happening in your code?那么您的代码中发生了什么？ You attempt to set the max value as you understand it, but it's being truncated to something 10x smaller by the sysctl sysctl_wmem_max .您尝试按照您的理解设置最大值，但它被 sysctl sysctl_wmem_max截断为小 10 倍的值。 This is then made far worse by the fact that setting this option now locks the buffer to that smaller size.由于现在设置此选项会将缓冲区锁定为较小的大小，因此情况会变得更糟。 The strange part is that apparently dynamically resizing the buffer doesn't have this same restriction, but can go all the way to the max.奇怪的是，显然动态调整缓冲区的大小没有同样的限制，但 go 可以一直到最大值。

If you look at the first code snip above, you see the SO_SNDBUFFORCE option.如果您查看上面的第一个代码片段，您会看到 SO_SNDBUFFORCE 选项。 This will disregard the sysctl_wmem_max and allow you to set essentially any buffer size provided you have the right permissions.这将忽略sysctl_wmem_max并允许您设置基本上任何缓冲区大小，只要您具有正确的权限。

It turns out processes launched in generic docker containers don't have CAP_NET_ADMIN , so in order to use this socket option, you must run in --privileged mode.事实证明，在通用 docker 容器中启动的进程没有CAP_NET_ADMIN ，因此要使用此套接字选项，您必须在 --privileged 模式下运行。 However, if you do, and if you force the max size, you will see your benchmark return the same throughput as not setting the option at all and allowing it to grow dynamically to the same size.但是，如果您这样做，并且如果您强制设置最大大小，您将看到您的基准测试返回的吞吐量与根本不设置选项并允许它动态增长到相同大小的吞吐量相同。

为什么设置 SO_SNDBUF 和 SO_RCVBUF 会破坏性能？

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-04-15 21:29:40

为什么设置 SO_SNDBUF 和 SO_RCVBUF 会破坏性能？

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-04-15 21:29:40

解决方案1
1 已采纳 2021-04-15 21:29:40