使用非 NULL 指针时出现分段错误

Question

使用 dpdk 时标题出现了一个奇怪的问题，

当我使用 rte_pktmbuf_alloc(struct rte_mempool *) 并且已经验证 rte_pktmbuf_pool_create() 的返回值不是 NULL 时，进程收到分段错误。

跟随

ing message is output of gdb in dpdk source code:

Thread 1 "osw" received signal SIGSEGV, Segmentation fault.

0x00000000005e9f41 in __mempool_generic_get (cache=0x1a7dfc000000000, n=1, obj_table=0x7fffffffdec8, mp=0x101a7df00)at /root/dpdk-20.05/x86_64-native-linuxapp-gcc/include/rte_mempool.h:1449

1449            if (unlikely(cache == NULL || n >= cache->size))

(gdb) p cache

$1 = (struct rte_mempool_cache *) 0x1a7dfc000000000

(gdb) bt

0  0x00000000005e9f41 in __mempool_generic_get (cache=0x1a7dfc000000000, n=1, obj_table=0x7fffffffdeb8, mp=0x101a7df00)

   at /root/dpdk-20.05/x86_64-native-linuxapp-gcc/include/rte_mempool.h:1449

1  rte_mempool_generic_get (cache=0x1a7dfc000000000, n=1, obj_table=0x7fffffffdeb8, mp=0x101a7df00)

   at /root/dpdk-20.05/x86_64-native-linuxapp-gcc/include/rte_mempool.h:1517

2  rte_mempool_get_bulk (n=1, obj_table=0x7fffffffdeb8, mp=0x101a7df00)

   at /root/dpdk-20.05/x86_64-native-linuxapp-gcc/include/rte_mempool.h:1552

3  rte_mempool_get (obj_p=0x7fffffffdeb8, mp=0x101a7df00) at /root/dpdk-20.05/x86_64-native-linuxapp-gcc/include/rte_mempool.h:1578

4  rte_mbuf_raw_alloc (mp=0x101a7df00) at /root/dpdk-20.05/x86_64-native-linuxapp-gcc/include/rte_mbuf.h:586

5  rte_pktmbuf_alloc (mp=0x101a7df00) at /root/dpdk-20.05/x86_64-native-linuxapp-gcc/include/rte_mbuf.h:896

我深入研究了 rte_mempool.h：

并更改线路 1449-1450

1449  if (unlikely(cache == NULL || n >= cache->size))

1450         goto ring_dequeue;

至

1449  if (unlikely(cache == NULL))

1450          goto ring_dequeue;

1451  if (unlikely(n >= cache->size))

1452          goto ring_dequeue;

它也在第 1451 行失败

更改后的 gdb output 消息：

Thread 1 "osw" received signal SIGSEGV, Segmentation fault.

__mempool_generic_get (cache=0x1a7dfc000000000, n=1, obj_table=0x7fffffffdeb8, mp=0x101a7df00)
   at /root/dpdk-20.05/x86_64-native-linuxapp-gcc/include/rte_mempool.h:1451

1451            if (unlikely(n >= cache->size))

(gdb) p cache

$1 = (struct rte_mempool_cache *) 0x1a7dfc000000000

(gdb) bt

0  __mempool_generic_get (cache=0x1a7dfc000000000, n=1, obj_table=0x7fffffffdeb8, mp=0x101a7df00)

   at /root/dpdk-20.05/x86_64-native-linuxapp-gcc/include/rte_mempool.h:1451

1  rte_mempool_generic_get (cache=0x1a7dfc000000000, n=1, obj_table=0x7fffffffdeb8, mp=0x101a7df00)

   at /root/dpdk-20.05/x86_64-native-linuxapp-gcc/include/rte_mempool.h:1519

2  rte_mempool_get_bulk (n=1, obj_table=0x7fffffffdeb8, mp=0x101a7df00)

   at /root/dpdk-20.05/x86_64-native-linuxapp-gcc/include/rte_mempool.h:1554

3  rte_mempool_get (obj_p=0x7fffffffdeb8, mp=0x101a7df00) at /root/dpdk-20.05/x86_64-native-linuxapp-gcc/include/rte_mempool.h:1580

4  rte_mbuf_raw_alloc (mp=0x101a7df00) at /root/dpdk-20.05/x86_64-native-linuxapp-gcc/include/rte_mbuf.h:586

5  rte_pktmbuf_alloc (mp=0x101a7df00) at /root/dpdk-20.05/x86_64-native-linuxapp-gcc/include/rte_mbuf.h:896

6  main (argc=<optimized out>, argv=<optimized out>) at ofpd.c:150

(gdb) p cache->size

Cannot access memory at address 0x1a7dfc000000000

看起来 memory 地址“缓存”指针存储的不是 NULL，但它实际上是 NULL 指针。

我不知道为什么“缓存”指针地址在前缀 4 字节处不为零，在后缀 4 字节处为零。

DPDK版本是20.05，我也试过18.11和19.11。

操作系统是 CentOS 8.1 kernel 是 4.18.0-147.el8.x86_64。

CPU 是 AMD EPYC 7401P。

#define                 RING_SIZE       16384
#define                 NUM_MBUFS       8191
#define                 MBUF_CACHE_SIZE 512

int main(int argc, char **argv)
{
    int             ret;
    uint16_t        portid;
    unsigned        cpu_id = 1;
    struct rte_mempool  *tmp;

    int arg = rte_eal_init(argc, argv);
    if (arg < 0) 
        rte_exit(EXIT_FAILURE, "Cannot init EAL: %s\n", rte_strerror(rte_errno));
    if (rte_lcore_count() < 10)
        rte_exit(EXIT_FAILURE, "We need at least 10 cores.\n");
    argc -= arg;
    argv += arg;

    /* Creates a new mempool in memory to hold the mbufs. */
    tmp = rte_pktmbuf_pool_create("TMP", NUM_MBUFS, MBUF_CACHE_SIZE, 0, RTE_MBUF_DEFAULT_BUF_SIZE, rte_socket_id());
    if (tmp == NULL)
        rte_exit(EXIT_FAILURE, "Cannot create mbuf pool, %s\n", rte_strerror(rte_errno));
    printf("tmp addr = %x\n", tmp);
    struct rte_mbuf *test = rte_pktmbuf_alloc(tmp);
    rte_exit(EXIT_FAILURE, "end\n");
}

我在使用 getifaddrs() 的返回指针时也遇到过同样的问题，它也出现了分段错误，我不得不像这样移动指针地址

ifa->ifa_addr = (struct sockaddr *)((uintptr_t)(ifa->ifa_addr) >> 32);

然后就可以正常工作了。

因此，我认为这不是 dpdk 特定的问题。

有谁知道这个问题？

谢谢。

Answer 1

我可以通过修改您的代码来运行它而不会出现任何错误

包括标题
删除未使用的变量
为alloc添加检查返回值是否为 NULL

测试：

CPU: Intel(R) Xeon(R) CPU E5-2699
OS: 4.15.0-101-generic
GCC: 7.5.0
DPDK version: 19.11.2, dpdk mainline
Library mode: static

代码：

 int main(int argc, char **argv)
{
    int             ret = 0;
    struct rte_mempool  *tmp;

    int arg = rte_eal_init(argc, argv);
    if (arg < 0)
        rte_exit(EXIT_FAILURE, "Cannot init EAL: %s\n", rte_strerror(rte_errno));
    if (rte_lcore_count() < 10)
        rte_exit(EXIT_FAILURE, "We need at least 10 cores.\n");
    argc -= arg;
    argv += arg;

    /* Creates a new mempool in memory to hold the mbufs. */
    tmp = rte_pktmbuf_pool_create("TMP", NUM_MBUFS, MBUF_CACHE_SIZE, 0, RTE_MBUF_DEFAULT_BUF_SIZE, rte_socket_id());
    if (tmp == NULL)
        rte_exit(EXIT_FAILURE, "Cannot create mbuf pool, %s\n", rte_strerror(rte_errno));
    printf("tmp addr = %p\n", tmp);
    struct rte_mbuf *test = rte_pktmbuf_alloc(tmp);

    if (test == NULL)
        rte_exit(EXIT_FAILURE, "end\n");

   return ret;
}

[EDIT-1] 基于Brydon Gibson的评论

笔记：

由于我无权访问您的代码库或工作代码片段，我的建议是从 DPDK/examples/l2fwd 或 DPDK/examples/skeleton 查找任何示例代码并复制标题以进行编译。
我假设作者THE和Brydon都是不同的人，并且可能在不同的代码库上面临相似的问题。
当前问题声称 DPDK 版本 20.05、18.11 和 19.11 使用代码片段重现错误。
当前答案清楚地符合static库的链接相同的代码片段有效

要求@BrydonGibson 打开带有相关信息和环境详细信息的工单，因为它可能会有所不同。

使用非 NULL 指针时出现分段错误

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-06-15 06:24:17

使用非 NULL 指针时出现分段错误

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-06-15 06:24:17

解决方案1
0 已采纳 2020-06-15 06:24:17