[英]Pinning user space buffer for DMA from Linux kernel
I'm writing driver for devices that produce around 1GB of data per second.我正在为每秒产生大约 1GB 数据的设备编写驱动程序。 Because of that I decided to map user buffer allocated by application directly for DMA instead of copying through intermediate kernel buffer.因此,我决定将应用程序直接分配给 DMA 的 map 用户缓冲区,而不是通过中间 kernel 缓冲区进行复制。
The code works, more or less.该代码或多或少有效。 But during long-run stress testing I see kernel oops with "bad page state" initiated by unrelated applications (for instance updatedb
), probably when kernel wants to swap some pages:但在长期压力测试期间,我看到 kernel oops 由不相关的应用程序(例如updatedb
)发起的“页面状态错误”,可能是在 kernel 想要交换一些页面时:
[21743.515404] BUG: Bad page state in process PmStabilityTest pfn:357518
[21743.521992] page:ffffdf844d5d4600 count:19792158 mapcount:0 mapping: (null) index:0x12b011e012d0132
[21743.531829] flags: 0x119012c01220124(referenced|lru|slab|reclaim|uncached|idle)
[21743.539138] raw: 0119012c01220124 0000000000000000 012b011e012d0132 012e011e011e0111
[21743.546899] raw: 0000000000000000 012101300131011c 0000000000000000 012101240123012b
[21743.554638] page dumped because: page still charged to cgroup
[21743.560383] page->mem_cgroup:012101240123012b
[21743.564745] bad because of flags: 0x120(lru|slab)
[21743.569555] BUG: Bad page state in process PmStabilityTest pfn:357519
[21743.576098] page:ffffdf844d5d4640 count:18219302 mapcount:18940179 mapping: (null) index:0x0
[21743.585318] flags: 0x0()
[21743.587859] raw: 0000000000000000 0000000000000000 0000000000000000 0116012601210112
[21743.595599] raw: 0000000000000000 011301310127012f 0000000000000000 012f011d010d011a
[21743.603336] page dumped because: page still charged to cgroup
[21743.609108] page->mem_cgroup:012f011d010d011a
...
Entering kdb (current=0xffff8948189b2d00, pid 6387) on processor 6 Oops: (null)
due to oops @ 0xffffffff9c87f469
CPU: 6 PID: 6387 Comm: updatedb.mlocat Tainted: G B OE 4.10.0-42-generic #46~16.04.1-Ubuntu
...
Details:细节:
The user buffer consists of frames and neither the buffer not the frames are page-aligned.用户缓冲区由帧组成,缓冲区和帧都不是页面对齐的。 The frames in buffer are used in circular manner for "infinite" live data transfers.缓冲区中的帧以循环方式用于“无限”实时数据传输。 For each frame I get memory pages via get_user_pages_fast
, then convert it to scatter-gatter table with sg_alloc_table_from_pages
and finally map for DMA using dma_map_sg
.对于每一帧,我通过get_user_pages_fast
获得 memory 页,然后使用 sg_alloc_table_from_pages 将其转换为 scatter-gatter 表,最后使用sg_alloc_table_from_pages
将 DMA 转换为dma_map_sg
。
I rely on sg_alloc_table_from_pages
to bind consecutive pages into one DMA descriptor to reduce size of S/G table sent to device.我依靠sg_alloc_table_from_pages
将连续的页面绑定到一个 DMA 描述符中,以减少发送到设备的 S/G 表的大小。 Devices are custom built and utilize FPGA.设备是定制的并使用 FPGA。 I took inspiration from many drivers doing similar mapping, especially video drivers i915 and radeon, but no one has all the stuff on one place so I might overlook something.我从许多做类似映射的驱动程序中获得灵感,尤其是视频驱动程序 i915 和 radeon,但没有人把所有的东西都放在一个地方,所以我可能会忽略一些东西。
Related functions ( pin_user_buffer
and unpin_user_buffer
are called upon separate IOCTLs):相关函数( pin_user_buffer
和unpin_user_buffer
在单独的 IOCTL 上调用):
static int pin_user_frame(struct my_dev *cam, struct udma_frame *frame)
{
const unsigned long bytes = cam->acq_frame_bytes;
const unsigned long first =
( frame->uaddr & PAGE_MASK) >> PAGE_SHIFT;
const unsigned long last =
((frame->uaddr + bytes - 1) & PAGE_MASK) >> PAGE_SHIFT;
const unsigned long offset =
frame->uaddr & ~PAGE_MASK;
int nr_pages = last - first + 1;
int err;
int n;
struct page **pages;
struct sg_table *sgt;
if (frame->uaddr + bytes < frame->uaddr) {
pr_err("%s: attempted user buffer overflow!\n", __func__);
return -EINVAL;
}
if (bytes == 0) {
pr_err("%s: user buffer has zero bytes\n", __func__);
return -EINVAL;
}
pages = kcalloc(nr_pages, sizeof(*pages), GFP_KERNEL | __GFP_ZERO);
if (!pages) {
pr_err("%s: can't allocate udma_frame.pages\n", __func__);
return -ENOMEM;
}
sgt = kzalloc(sizeof(*sgt), GFP_KERNEL);
if (!sgt) {
pr_err("%s: can't allocate udma_frame.sgt\n", __func__);
err = -ENOMEM;
goto err_alloc_sgt;
}
/* (rw == READ) means read from device, write into memory area */
err = get_user_pages_fast(frame->uaddr, nr_pages, READ == READ, pages);
if (err < nr_pages) {
nr_pages = err;
if (err > 0) {
pr_err("%s: can't pin all %d user pages, got %d\n",
__func__, nr_pages, err);
err = -EFAULT;
} else {
pr_err("%s: can't pin user pages\n", __func__);
}
goto err_get_pages;
}
for (n = 0; n < nr_pages; ++n)
flush_dcache_page(pages[n]); //<--- Is this needed?
err = sg_alloc_table_from_pages(sgt, pages, nr_pages, offset, bytes,
GFP_KERNEL);
if (err) {
pr_err("%s: can't build sg_table for %d pages\n",
__func__, nr_pages);
goto err_alloc_sgt2;
}
if (!dma_map_sg(&cam->pci_dev->dev, sgt->sgl, sgt->nents, DMA_FROM_DEVICE)) {
pr_err("%s: can't map %u sg_table entries for DMA\n",
__func__, sgt->nents);
err = -ENOMEM;
goto err_dma_map;
}
frame->pages = pages;
frame->nr_pages = nr_pages;
frame->sgt = sgt;
return 0;
err_dma_map:
sg_free_table(sgt);
err_alloc_sgt2:
err_get_pages:
for (n = 0; n < nr_pages; ++n)
put_page(pages[n]);
kfree(sgt);
err_alloc_sgt:
kfree(pages);
return err;
}
static void unpin_user_frame(struct my_dev *cam, struct udma_frame *frame)
{
int n;
dma_unmap_sg(&cam->pci_dev->dev, frame->sgt->sgl, frame->sgt->nents,
DMA_FROM_DEVICE);
sg_free_table(frame->sgt);
kfree(frame->sgt);
frame->sgt = NULL;
for (n = 0; n < frame->nr_pages; ++n) {
struct page *page = frame->pages[n];
set_page_dirty_lock(page);
mark_page_accessed(page); //<--- Without this the Oops are more frequent
put_page(page);
}
kfree(frame->pages);
frame->pages = NULL;
frame->nr_pages = 0;
}
static void unpin_user_buffer(struct my_dev *cam)
{
if (cam->udma_frames) {
int n;
for (n = 0; n < cam->udma_frame_count; ++n)
unpin_user_frame(cam, &cam->udma_frames[n]);
kfree(cam->udma_frames);
cam->udma_frames = NULL;
}
cam->udma_frame_count = 0;
cam->udma_buffer_bytes = 0;
cam->udma_buffer = NULL;
cam->udma_desc_count = 0;
}
static int pin_user_buffer(struct my_dev *cam)
{
int err;
int n;
const u32 acq_frame_count = cam->acq_buffer_bytes / cam->acq_frame_bytes;
struct udma_frame *udma_frames;
u32 udma_desc_count = 0;
if (!cam->acq_buffer) {
pr_err("%s: user buffer is NULL!\n", __func__);
return -EFAULT;
}
if (cam->udma_buffer == cam->acq_buffer
&& cam->udma_buffer_bytes == cam->acq_buffer_bytes
&& cam->udma_frame_count == acq_frame_count)
return 0;
if (cam->udma_buffer)
unpin_user_buffer(cam);
udma_frames = kcalloc(acq_frame_count, sizeof(*udma_frames),
GFP_KERNEL | __GFP_ZERO);
if (!udma_frames) {
pr_err("%s: can't allocate udma_frame array for %u frames\n",
__func__, acq_frame_count);
return -ENOMEM;
}
for (n = 0; n < acq_frame_count; ++n) {
struct udma_frame *frame = &udma_frames[n];
frame->uaddr =
(unsigned long)(cam->acq_buffer + n * cam->acq_frame_bytes);
err = pin_user_frame(cam, frame);
if (err) {
pr_err("%s: can't pin frame %d (out of %u)\n",
__func__, n + 1, acq_frame_count);
for (--n; n >= 0; --n)
unpin_user_frame(cam, frame);
kfree(udma_frames);
return err;
}
udma_desc_count += frame->sgt->nents; /* Cannot overflow */
}
pr_debug("%s: total udma_desc_count=%u\n", __func__, udma_desc_count);
cam->udma_buffer = cam->acq_buffer;
cam->udma_buffer_bytes = cam->acq_buffer_bytes;
cam->udma_frame_count = acq_frame_count;
cam->udma_frames = udma_frames;
cam->udma_desc_count = udma_desc_count;
return 0;
}
Related structures:相关结构:
struct udma_frame {
unsigned long uaddr; /* User address of the frame */
int nr_pages; /* Nr. of pages covering the frame */
struct page **pages; /* Actual pages covering the frame */
struct sg_table *sgt; /* S/G table describing the frame */
};
struct my_dev {
...
u8 __user *acq_buffer; /* User-space buffer received via IOCTL */
...
u8 __user *udma_buffer; /* User-space buffer for image */
u32 udma_buffer_bytes; /* Total image size in bytes */
u32 udma_frame_count; /* Nr. of items in udma_frames */
struct udma_frame
*udma_frames; /* DMA descriptors per frame */
u32 udma_desc_count; /* Total nr. of DMA descriptors */
...
};
Questions:问题:
DMA_FROM_DEVICE
or rather use DMA_BIDIRECTIONAL
just in case?我可以使用DMA_FROM_DEVICE
还是使用DMA_BIDIRECTIONAL
以防万一?SetPageReserved
/ ClearPageReserved
or mark_page_reserved
/ free_reserved_page
?我是否需要使用SetPageReserved
/ ClearPageReserved
或mark_page_reserved
/ free_reserved_page
的东西?sg_alloc_table_from_pages
if swiotlb is active?例如,如果 swiotlb 处于活动状态,i915 驱动程序不使用sg_alloc_table_from_pages
?set_page_dirty
, set_page_dirty_lock
and SetPageDirty
functions? set_page_dirty
、 set_page_dirty_lock
和SetPageDirty
函数之间有什么区别?Thanks for any hint.感谢您的任何提示。
PS: I cannot change the way the application gets the data without breaking our library API maintained for many years. PS:我无法在不破坏我们维护多年的库API的情况下更改应用程序获取数据的方式。 So please do not advise eg to mmap
kernel buffer...所以请不要建议例如mmap
kernel 缓冲区...
Why do you put "READ == READ" as the third paramter?为什么把“READ == READ”作为第三个参数? You need put flag there.你需要把旗帜放在那里。
err = get_user_pages_fast(frame->uaddr, nr_pages, READ == READ, pages);
You need put "FOLL_LONGTERM" here, and FOLL_PIN is set by get_user_pages_fast internally.您需要在此处输入“FOLL_LONGTERM”,FOLL_PIN 由 get_user_pages_fast 内部设置。 See https://www.kernel.org/doc/html/latest/core-api/pin_user_pages.html#case-2-rdma参见https://www.kernel.org/doc/html/latest/core-api/pin_user_pages.html#case-2-rdma
In addition, you need take care of cpu and device memory coherence.此外,您需要注意 cpu 和设备 memory 的一致性。 Just call "dma_sync_sg_for_device(...)" before dma transfer, and "dma_sync_sg_for_cpu(...)" after dma transfer.只需在 dma 传输前调用“dma_sync_sg_for_device(...)”,在 dma 传输后调用“dma_sync_sg_for_cpu(...)”。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.