如何在Linux内核模块中分配由1GB HugePages支持的DMA缓冲区？

Question

I'm trying to allocate a DMA buffer for a HPC workload. 我正在尝试为HPC工作负载分配DMA缓冲区。 It requires 64GB of buffer space. 它需要64GB的缓冲区空间。 In between computation, some data is offloaded to a PCIe card. 在计算之间，一些数据被卸载到PCIe卡上。 Rather than copy data into a bunch of dinky 4MB buffers given by pci_alloc_consistent, I would like to just create 64 1GB buffers, backed by 1GB HugePages. 与其将数据复制到pci_alloc_consistent给定的一堆笨拙的4MB缓冲区中，我想创建64个1GB缓冲区，并由1GB HugePages支持。

Some background info: kernel version: CentOS 6.4 / 2.6.32-358.el6.x86_64 kernel boot options: hugepagesz=1g hugepages=64 default_hugepagesz=1g 一些背景信息：内核版本：CentOS 6.4 / 2.6.32-358.el6.x86_64内核引导选项：hugepagesz = 1g hugepages = 64 default_hugepagesz = 1g

relevant portion of /proc/meminfo: AnonHugePages: 0 kB HugePages_Total: 64 HugePages_Free: 64 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 1048576 kB DirectMap4k: 848 kB DirectMap2M: 2062336 kB DirectMap1G: 132120576 kB / proc / meminfo的相关部分：AnonHugePages：0 kB HugePages_Total：64 HugePages_Free：64 HugePages_Rsvd：0 HugePages_Surp：0 Hugepagesize：1048576 kB DirectMap4k：848 kB DirectMap2M：2062336 kB DirectMap1G：132120576

I can mount -t hugetlbfs nodev /mnt/hugepages. 我可以挂载-t hugetlbfs nodev / mnt / hugepages。 CONFIG_HUGETLB_PAGE is true. CONFIG_HUGETLB_PAGE为true。 MAP_HUGETLB is defined. 定义了MAP_HUGETLB。

I have read some info on using libhugetlbfs to call get_huge_pages() in user space, but ideally this buffer would be allocated in kernel space. 我已经阅读了一些有关使用libhugetlbfs在用户空间中调用get_huge_pages（）的信息，但理想情况下，此缓冲区应在内核空间中分配。 I tried calling do_mmap() with MAP_HUGETLB but it didn't seem to change the number of free hugepages, so I don't think it was actually backing the mmap with huge pages. 我尝试使用MAP_HUGETLB调用do_mmap（），但是它似乎并没有改变免费大页面的数量，因此我认为它实际上并没有支持大页面的mmap。

So I guess what I'm getting at, is there any way I can map a buffer to a 1GB HugePage in kernel space, or does it have to be done in user space? 因此，我想我正在寻找什么，是否有什么方法可以将缓冲区映射到内核空间中的1GB HugePage，还是必须在用户空间中完成？ Or if anyone knows of any other way I can get an immense (1-64GB) amount of contiguous physical memory available as a kernel buffer? 或者，如果有人知道其他方法，那么我可以获得大量（1-64GB）的连续物理内存作为内核缓冲区吗？

Answer 1

This is not commonly done in the kernel space, so not too many examples. 在内核空间中通常不这样做，因此没有太多示例。

Just like any other page, huge pages are allocated with alloc_pages, to the tune: 就像其他页面一样，巨大的页面也通过alloc_pages分配：

struct page *p = alloc_pages(GFP_TRANSHUGE, HPAGE_PMD_ORDER);

HPAGE_PMD_ORDER is a macro, defining an order of a single huge page in terms of normal pages. HPAGE_PMD_ORDER是一个宏，用于根据正常页面定义单个大页面的顺序。 The above implies that transparent huge pages are enabled in kernel. 上面的意思是在内核中启用了透明的大页面。

Then you can proceed mapping the obtained page pointer in the usual fashion with kmap(). 然后，您可以使用kmap（）以通常的方式继续映射获取的页面指针。

Disclaimer: I never tried it myself, so you may have to do some experimenting around. 免责声明：我从未亲自尝试过，因此您可能需要做一些尝试。 One thing to check for is this: HPAGE_PMD_SHIFT represents an order of a smaller "huge" page. 要检查的一件事是：HPAGE_PMD_SHIFT表示较小的“巨大”页面的顺序。 If you want to use those giant 1GB pages, you will probably need to try a different order, probably PUD_SHIFT - PAGE_SHIFT. 如果您想使用这些巨大的1GB页面，则可能需要尝试不同的顺序，可能是PUD_SHIFT-PAGE_SHIFT。

Answer 2

PROBLEM 问题

Normally if you want to allocate a DMA buffer, or get a physical address, this is done in kernel space, as user code should never have to muck around with physical addresses. 通常，如果您要分配DMA缓冲区或获取物理地址，这是在内核空间中完成的，因为用户代码永远不必乱搞物理地址。
Hugetlbfs only provides user-space mappings to allocate 1GB huge pages, and get user-space virtual addresses Hugetlbfs仅提供用户空间映射来分配1GB的大页面并获取用户空间虚拟地址
No function exists to map a user hugepage virtual address to a physical address 不存在将用户大页面虚拟地址映射到物理地址的功能

EUREKA 欧瑞卡

But the function does exist! 但是功能确实存在！ Buried deep in the 2.6 kernel source code lies this function to get a struct page from a virtual address, marked as "just for testing" and blocked with #if 0: 此功能深深埋入2.6内核源代码中，该功能可从虚拟地址获取结构页面，该页面被标记为“仅用于测试”并被#if 0阻止：

#if 0   /* This is just for testing */
struct page *
follow_huge_addr(struct mm_struct *mm, unsigned long address, int write)
{
    unsigned long start = address;
    int length = 1;
    int nr;
    struct page *page;
    struct vm_area_struct *vma;

    vma = find_vma(mm, addr);
    if (!vma || !is_vm_hugetlb_page(vma))
        return ERR_PTR(-EINVAL);

    pte = huge_pte_offset(mm, address);

    /* hugetlb should be locked, and hence, prefaulted */
    WARN_ON(!pte || pte_none(*pte));

    page = &pte_page(*pte)[vpfn % (HPAGE_SIZE/PAGE_SIZE)];

    WARN_ON(!PageHead(page));

    return page;
}

SOLUTION: Since the function above isn't actually compiled into the kernel, you will need to add it to your driver source. 解决方案：由于上述函数实际上并未编译到内核中，因此您需要将其添加到驱动程序源中。

USER SIDE WORKFLOW 用户端工作流程

Allocate 1gb hugepages at boot with kernel boot options 使用内核启动选项在启动时分配1gb大页面
Call get_huge_pages() with hugetlbfs to get user space pointer (virtual address) 使用ugeltlbfs调用get_huge_pages（）以获取用户空间指针（虚拟地址）
Pass user virtual address (normal pointer cast to unsigned long) to driver ioctl 将用户虚拟地址（普通指针转换为无符号长整数）传递给驱动程序ioctl

KERNEL DRIVER WORKFLOW 内核驱动程序工作流

Accept user virtual address via ioctl 通过ioctl接受用户虚拟地址
Call follow_huge_addr to get the struct page* 调用follow_huge_addr获取结构页面*
Call page_to_phys on the struct page* to get the physical address 在结构页面*上调用page_to_phys以获取物理地址
Provide physical address to device for DMA 为DMA提供设备物理地址
Call kmap on the struct page* if you also want a kernel virtual pointer 如果还需要内核虚拟指针，请在结构页面*上调用kmap

DISCLAIMER 免责声明

The above steps are being recollected several years later. 几年后将重新收集以上步骤。 I have lost access to the original source code. 我无法访问原始源代码。 Do your due diligence and make sure I'm not forgetting a step. 做您的尽职调查，并确保我不会忘记一步。
The only reason this works is because 1GB huge pages are allocated at boot time and their physical addresses are permanently locked. 起作用的唯一原因是因为在引导时分配了1GB的大页面，并且其物理地址被永久锁定。 Don't try to map a non-1GBhugepage-backed user virtual address into a DMA physical address! 不要尝试将非1GB超大页面支持的用户虚拟地址映射到DMA物理地址！ You're going to have a bad time! 您将度过一段糟糕的时光！
Test carefully on your system to confirm that your 1GB huge pages are in fact locked in physical memory and that everything is working exactly. 在系统上进行仔细测试，以确认实际上1GB的大页面已锁定在物理内存中，并且一切正常。 This code worked flawlessly on my setup, but there is great danger here if something goes wrong. 该代码在我的设置上可以完美地工作，但是如果出现问题，这里存在很大的危险。
This code is only guaranteed to work on x86/x64 architecture (where physical address == bus address), and on kernel version 2.6.XX. 仅保证此代码在x86 / x64体系结构（物理地址==总线地址）以及内核版本2.6.XX上工作。 There may be an easier way to do this on later kernel versions, or it may be completely impossible now. 在更高版本的内核上可能有更简单的方法来执行此操作，或者现在完全不可能。

Answer 3

This function returns correct virtual addr in kernel space if given physical address from user space allocated in hugespace. 如果从巨大空间中分配的用户空间获得了给定的物理地址，则此函数将在内核空间中返回正确的虚拟地址。

static inline void * phys_to_virt(unsigned long address)

Look for function on kernel code, it is tested with dpdk and kernel module. 在内核代码上查找功能，已通过dpdk和内核模块进行了测试。

如何在Linux内核模块中分配由1GB HugePages支持的DMA缓冲区？

问题描述

3 个解决方案

解决方案1
2 2013-10-31 02:04:44

解决方案2
1 已采纳 2017-06-26 09:35:53

解决方案3
0 2019-04-05 15:41:59

如何在Linux内核模块中分配由1GB HugePages支持的DMA缓冲区？

问题描述

3 个解决方案

解决方案1 2 2013-10-31 02:04:44

解决方案2 1 已采纳 2017-06-26 09:35:53

解决方案3 0 2019-04-05 15:41:59

解决方案1
2 2013-10-31 02:04:44

解决方案2
1 已采纳 2017-06-26 09:35:53

解决方案3
0 2019-04-05 15:41:59