从 Linux kernel 为 DMA 固定用户空间缓冲区

Question

I'm writing driver for devices that produce around 1GB of data per second.我正在为每秒产生大约 1GB 数据的设备编写驱动程序。 Because of that I decided to map user buffer allocated by application directly for DMA instead of copying through intermediate kernel buffer.因此，我决定将应用程序直接分配给 DMA 的 map 用户缓冲区，而不是通过中间 kernel 缓冲区进行复制。

The code works, more or less.该代码或多或少有效。 But during long-run stress testing I see kernel oops with "bad page state" initiated by unrelated applications (for instance updatedb ), probably when kernel wants to swap some pages:但在长期压力测试期间，我看到 kernel oops 由不相关的应用程序（例如updatedb ）发起的“页面状态错误”，可能是在 kernel 想要交换一些页面时：

[21743.515404] BUG: Bad page state in process PmStabilityTest  pfn:357518
[21743.521992] page:ffffdf844d5d4600 count:19792158 mapcount:0 mapping:          (null) index:0x12b011e012d0132
[21743.531829] flags: 0x119012c01220124(referenced|lru|slab|reclaim|uncached|idle)
[21743.539138] raw: 0119012c01220124 0000000000000000 012b011e012d0132 012e011e011e0111
[21743.546899] raw: 0000000000000000 012101300131011c 0000000000000000 012101240123012b
[21743.554638] page dumped because: page still charged to cgroup
[21743.560383] page->mem_cgroup:012101240123012b
[21743.564745] bad because of flags: 0x120(lru|slab)
[21743.569555] BUG: Bad page state in process PmStabilityTest  pfn:357519
[21743.576098] page:ffffdf844d5d4640 count:18219302 mapcount:18940179 mapping:          (null) index:0x0
[21743.585318] flags: 0x0()
[21743.587859] raw: 0000000000000000 0000000000000000 0000000000000000 0116012601210112
[21743.595599] raw: 0000000000000000 011301310127012f 0000000000000000 012f011d010d011a
[21743.603336] page dumped because: page still charged to cgroup
[21743.609108] page->mem_cgroup:012f011d010d011a
...
Entering kdb (current=0xffff8948189b2d00, pid 6387) on processor 6 Oops: (null)
due to oops @ 0xffffffff9c87f469
CPU: 6 PID: 6387 Comm: updatedb.mlocat Tainted: G    B      OE   4.10.0-42-generic #46~16.04.1-Ubuntu
...

Details:细节：

The user buffer consists of frames and neither the buffer not the frames are page-aligned.用户缓冲区由帧组成，缓冲区和帧都不是页面对齐的。 The frames in buffer are used in circular manner for "infinite" live data transfers.缓冲区中的帧以循环方式用于“无限”实时数据传输。 For each frame I get memory pages via get_user_pages_fast , then convert it to scatter-gatter table with sg_alloc_table_from_pages and finally map for DMA using dma_map_sg .对于每一帧，我通过get_user_pages_fast获得 memory 页，然后使用 sg_alloc_table_from_pages 将其转换为 scatter-gatter 表，最后使用sg_alloc_table_from_pages将 DMA 转换为dma_map_sg 。

I rely on sg_alloc_table_from_pages to bind consecutive pages into one DMA descriptor to reduce size of S/G table sent to device.我依靠sg_alloc_table_from_pages将连续的页面绑定到一个 DMA 描述符中，以减少发送到设备的 S/G 表的大小。 Devices are custom built and utilize FPGA.设备是定制的并使用 FPGA。 I took inspiration from many drivers doing similar mapping, especially video drivers i915 and radeon, but no one has all the stuff on one place so I might overlook something.我从许多做类似映射的驱动程序中获得灵感，尤其是视频驱动程序 i915 和 radeon，但没有人把所有的东西都放在一个地方，所以我可能会忽略一些东西。

Related functions ( pin_user_buffer and unpin_user_buffer are called upon separate IOCTLs):相关函数（ pin_user_buffer和unpin_user_buffer在单独的 IOCTL 上调用）：

static int pin_user_frame(struct my_dev *cam, struct udma_frame *frame)
{
        const unsigned long bytes = cam->acq_frame_bytes;
        const unsigned long first =
                ( frame->uaddr              &  PAGE_MASK) >> PAGE_SHIFT;
        const unsigned long last =
                ((frame->uaddr + bytes - 1) &  PAGE_MASK) >> PAGE_SHIFT;
        const unsigned long offset =
                  frame->uaddr              & ~PAGE_MASK;
        int nr_pages = last - first + 1;
        int err;
        int n;
        struct page **pages;
        struct sg_table *sgt;

        if (frame->uaddr + bytes < frame->uaddr) {
                pr_err("%s: attempted user buffer overflow!\n", __func__);
                return -EINVAL;
        }

        if (bytes == 0) {
                pr_err("%s: user buffer has zero bytes\n", __func__);
                return -EINVAL;
        }

        pages = kcalloc(nr_pages, sizeof(*pages), GFP_KERNEL | __GFP_ZERO);
        if (!pages) {
                pr_err("%s: can't allocate udma_frame.pages\n", __func__);
                return -ENOMEM;
        }

        sgt = kzalloc(sizeof(*sgt), GFP_KERNEL);
        if (!sgt) {
                pr_err("%s: can't allocate udma_frame.sgt\n", __func__);
                err = -ENOMEM;
                goto err_alloc_sgt;
        }

        /* (rw == READ) means read from device, write into memory area */
        err = get_user_pages_fast(frame->uaddr, nr_pages, READ == READ, pages);
        if (err < nr_pages) {
                nr_pages = err;
                if (err > 0) {
                        pr_err("%s: can't pin all %d user pages, got %d\n",
                               __func__, nr_pages, err);
                        err = -EFAULT;
                } else {
                        pr_err("%s: can't pin user pages\n", __func__);
                }
                goto err_get_pages;
        }

        for (n = 0; n < nr_pages; ++n)
                flush_dcache_page(pages[n]); //<--- Is this needed?

        err = sg_alloc_table_from_pages(sgt, pages, nr_pages, offset, bytes,
                                        GFP_KERNEL);
        if (err) {
                pr_err("%s: can't build sg_table for %d pages\n",
                       __func__, nr_pages);
                goto err_alloc_sgt2;
        }

        if (!dma_map_sg(&cam->pci_dev->dev, sgt->sgl, sgt->nents, DMA_FROM_DEVICE)) {
                pr_err("%s: can't map %u sg_table entries for DMA\n",
                       __func__, sgt->nents);
                err = -ENOMEM;
                goto err_dma_map;
        }

        frame->pages = pages;
        frame->nr_pages = nr_pages;
        frame->sgt = sgt;

        return 0;

err_dma_map:
        sg_free_table(sgt);

err_alloc_sgt2:
err_get_pages:
        for (n = 0; n < nr_pages; ++n)
                put_page(pages[n]);
        kfree(sgt);

err_alloc_sgt:
        kfree(pages);

        return err;
}

static void unpin_user_frame(struct my_dev *cam, struct udma_frame *frame)
{
        int n;

        dma_unmap_sg(&cam->pci_dev->dev, frame->sgt->sgl, frame->sgt->nents,
                     DMA_FROM_DEVICE);

        sg_free_table(frame->sgt);
        kfree(frame->sgt);
        frame->sgt = NULL;

        for (n = 0; n < frame->nr_pages; ++n) {
                struct page *page = frame->pages[n];
                set_page_dirty_lock(page);
                mark_page_accessed(page); //<--- Without this the Oops are more frequent
                put_page(page);
        }
        kfree(frame->pages);
        frame->pages = NULL;

        frame->nr_pages = 0;
}

static void unpin_user_buffer(struct my_dev *cam)
{
        if (cam->udma_frames) {
                int n;
                for (n = 0; n < cam->udma_frame_count; ++n)
                        unpin_user_frame(cam, &cam->udma_frames[n]);
                kfree(cam->udma_frames);
                cam->udma_frames = NULL;
        }
        cam->udma_frame_count = 0;
        cam->udma_buffer_bytes = 0;
        cam->udma_buffer = NULL;
        cam->udma_desc_count = 0;
}

static int pin_user_buffer(struct my_dev *cam)
{
        int err;
        int n;
        const u32 acq_frame_count = cam->acq_buffer_bytes / cam->acq_frame_bytes;
        struct udma_frame *udma_frames;
        u32 udma_desc_count = 0;

        if (!cam->acq_buffer) {
                pr_err("%s: user buffer is NULL!\n", __func__);
                return -EFAULT;
        }

        if (cam->udma_buffer == cam->acq_buffer
            && cam->udma_buffer_bytes == cam->acq_buffer_bytes
            && cam->udma_frame_count == acq_frame_count)
                return 0;

        if (cam->udma_buffer)
                unpin_user_buffer(cam);

        udma_frames = kcalloc(acq_frame_count, sizeof(*udma_frames),
                              GFP_KERNEL | __GFP_ZERO);
        if (!udma_frames) {
                pr_err("%s: can't allocate udma_frame array for %u frames\n",
                       __func__, acq_frame_count);
                return -ENOMEM;
        }

        for (n = 0; n < acq_frame_count; ++n) {
                struct udma_frame *frame = &udma_frames[n];
                frame->uaddr =
                        (unsigned long)(cam->acq_buffer + n * cam->acq_frame_bytes);
                err = pin_user_frame(cam, frame);
                if (err) {
                        pr_err("%s: can't pin frame %d (out of %u)\n",
                               __func__, n + 1, acq_frame_count);
                        for (--n; n >= 0; --n)
                                unpin_user_frame(cam, frame);
                        kfree(udma_frames);
                        return err;
                }
                udma_desc_count += frame->sgt->nents; /* Cannot overflow */
        }
        pr_debug("%s: total udma_desc_count=%u\n", __func__, udma_desc_count);

        cam->udma_buffer = cam->acq_buffer;
        cam->udma_buffer_bytes = cam->acq_buffer_bytes;
        cam->udma_frame_count = acq_frame_count;
        cam->udma_frames = udma_frames;
        cam->udma_desc_count = udma_desc_count;

        return 0;
}

Related structures:相关结构：

struct udma_frame {
        unsigned long   uaddr;      /* User address of the frame */
        int             nr_pages;   /* Nr. of pages covering the frame */
        struct page     **pages;    /* Actual pages covering the frame */
        struct sg_table *sgt;       /* S/G table describing the frame */
};

struct my_dev {
        ...
        u8 __user   *acq_buffer;   /* User-space buffer received via IOCTL */
        ...
        u8 __user   *udma_buffer;       /* User-space buffer for image */
        u32         udma_buffer_bytes;  /* Total image size in bytes */
        u32         udma_frame_count;   /* Nr. of items in udma_frames */
        struct udma_frame
                    *udma_frames;       /* DMA descriptors per frame */
        u32         udma_desc_count;    /* Total nr. of DMA descriptors */
        ...
};

Questions:问题：

How to properly pin user buffer pages and mark them as not movable?如何正确固定用户缓冲区页面并将其标记为不可移动？
If one frame ends and next frame starts in the same page, is it correct to handle it as two independent pages, ie pin the page twice?如果在同一页中一帧结束而下一帧开始，将其作为两个独立页面处理是否正确，即固定页面两次？
The data comes from device to user buffer and app is supposed to not write to its buffer, but I have no control over it.数据从设备到用户缓冲区，应用程序不应写入其缓冲区，但我无法控制它。 Can I use DMA_FROM_DEVICE or rather use DMA_BIDIRECTIONAL just in case?我可以使用DMA_FROM_DEVICE还是使用DMA_BIDIRECTIONAL以防万一？
Do I need to use something like SetPageReserved / ClearPageReserved or mark_page_reserved / free_reserved_page ?我是否需要使用SetPageReserved / ClearPageReserved或mark_page_reserved / free_reserved_page的东西？
Is IOMMU/swiotlb somehow involved? IOMMU/swiotlb 是否以某种方式参与其中？ Eg i915 driver doesn't use sg_alloc_table_from_pages if swiotlb is active?例如，如果 swiotlb 处于活动状态，i915 驱动程序不使用sg_alloc_table_from_pages ？
What the difference between set_page_dirty , set_page_dirty_lock and SetPageDirty functions? set_page_dirty 、 set_page_dirty_lock和SetPageDirty函数之间有什么区别？

Thanks for any hint.感谢您的任何提示。

PS: I cannot change the way the application gets the data without breaking our library API maintained for many years. PS：我无法在不破坏我们维护多年的库API的情况下更改应用程序获取数据的方式。 So please do not advise eg to mmap kernel buffer...所以请不要建议例如mmap kernel 缓冲区...

Answer 1

Why do you put "READ == READ" as the third paramter?为什么把“READ == READ”作为第三个参数？ You need put flag there.你需要把旗帜放在那里。

err = get_user_pages_fast(frame->uaddr, nr_pages, READ == READ, pages);

You need put "FOLL_LONGTERM" here, and FOLL_PIN is set by get_user_pages_fast internally.您需要在此处输入“FOLL_LONGTERM”，FOLL_PIN 由 get_user_pages_fast 内部设置。 See https://www.kernel.org/doc/html/latest/core-api/pin_user_pages.html#case-2-rdma参见https://www.kernel.org/doc/html/latest/core-api/pin_user_pages.html#case-2-rdma

In addition, you need take care of cpu and device memory coherence.此外，您需要注意 cpu 和设备 memory 的一致性。 Just call "dma_sync_sg_for_device(...)" before dma transfer, and "dma_sync_sg_for_cpu(...)" after dma transfer.只需在 dma 传输前调用“dma_sync_sg_for_device(...)”，在 dma 传输后调用“dma_sync_sg_for_cpu(...)”。

从 Linux kernel 为 DMA 固定用户空间缓冲区

问题描述

1 个解决方案

解决方案1
0 2022-02-22 15:41:31

从 Linux kernel 为 DMA 固定用户空间缓冲区

问题描述

1 个解决方案

解决方案1 0 2022-02-22 15:41:31

解决方案1
0 2022-02-22 15:41:31