简体繁体 English

在块层/设备上绕过4KB块大小限制

[英]Bypassing 4KB block size limitation on block layer/device

原文 2015-06-02 00:39:35 2 2 linux/ linux-kernel/ filesystems/ linux-device-driver

We are developing an ssd-type storage hardware device that can take read/write request for big block size >4KB at a time (even in MBs size). 我们正在开发一种ssd类型的存储硬件设备，它可以一次接收大块大小> 4KB的读/写请求（即使是MB大小）。 My understanding is that linux and its filesystem will "chop down" files into 4KB block size that will be passed to block device driver, which will need to physically fill the block with data from the device (ex., for write) 我的理解是linux及其文件系统会将文件“砍掉”成4KB的块大小，这些块大小将被传递给块设备驱动程序，这需要用来从设备中物理填充块（例如，用于写入）

I am also aware the kernel page size has a role in this limitation as it is set at 4KB. 我也知道内核页面大小在此限制中起作用，因为它设置为4KB。

For experiment, I want to find out if there is a way to actually increase this block size, so that we will save some time (instead of doing multiple 4KB writes, we can do it with bigger block size). 对于实验，我想知道是否有办法实际增加这个块大小，这样我们将节省一些时间（而不是多次4KB写入，我们可以用更大的块大小来做）。

Is there any FS or any existing project that I can take a look for this? 有没有FS或任何现有的项目我可以看看这个？ If not, what is needed to do this experiment - what parts of linux needs to be modified? 如果没有，那么做这个实验需要什么 - 需要修改linux的哪些部分？ Trying to find out the level of difficulties and resource needed. 试图找出所需的困难和资源水平。 Or, if it is even impossible to do so and/or any reason why we do not even need to do so. 或者，如果甚至不可能这样做和/或任何理由我们甚至不需要这样做。 Any comment is appreciated. 任何评论表示赞赏。

Thanks. 谢谢。

2 个解决方案

The 4k limitation is due to the page cache. 4k限制是由于页面缓存造成的。 The main issue is that if you have a 4k page size, but a 32k block size, what happens if the file is only 2000 bytes long, so you only allocate a 4k page to cover the first 4k of the block. 主要问题是，如果你有一个4k的页面大小，但块大小为32k，如果文件长度只有2000字节会发生什么，所以你只分配一个4k的页面来覆盖块的前4k。 Now someone seeks to offset 20000, and writes a single byte. 现在有人试图抵消20000，并写一个字节。 Now suppose the system is under a lot of memory pressure, and the 4k page for the first 2000 bytes, which is clean, gets pushed out of memory. 现在假设系统承受了很大的内存压力，而前一个2000字节的4k页面是干净的，会被推出内存。 How do you track which parts of the 32k block contain valid data, and what happens when the system needs to write out the dirty page at offset 20000? 如何跟踪32k块的哪些部分包含有效数据，以及当系统需要在偏移20000处写出脏页时会发生什么？

Also, let's assume that the system is under a huge amount of memory pressure, we need to write out that last page; 另外，让我们假设系统存在巨大的内存压力，我们需要写出最后一页; what if there isn't enough memory available to instantiante the other 28k of the 32k block, so we can do the read-modify-write cycle just to update that one dirty 4k page at offset 20000? 如果没有足够的内存可用于实例化32k块的其他28k，那么我们可以执行读 - 修改 - 写周期只是为了在偏移20000处更新那个脏的4k页？

These problems can all be solved, but it would require a lot of surgery in the VM layer. 这些问题都可以解决，但需要在VM层进行大量手术。 The VM layer would need to know that for this file system, pages need to be instantiated in chunks of 8 pages at a time, and if that there is memory pressure to push out a particular page, you need write out all of the 8 pages at the same time if it is dirty, and then drop all 8 pages from the page cache at the same time. VM层需要知道对于这个文件系统，页面需要一次以8个页面的块进行实例化，如果有内存压力推出特定页面，则需要写出所有8个页面同时如果它是脏的，然后同时从页面缓存中删除所有8个页面。 All of this implies that you want to track page usage and page dirty not at the 4k page level, but at the compound 32k page/"block" level. 所有这些都意味着您希望跟踪页面使用情况和页面脏污，而不是在4k页面级别，而是在复合32k页面/“块”级别。 It basically will involve changes to almost every single part of the VM subsystem, from the page cleaner, to the page fault handler, the page scanner, the writeback algorithms, etc., etc., etc. 它基本上将涉及对VM子系统的几乎每个部分的更改，从页面清理器到页面错误处理程序，页面扫描程序，写回算法等等等。

Also consider that even if you did hire a Linux VM expert to do this work, (which the HDD vendors would deeply love you for, since they also want to be able to deploy HDD's with a 32k or 64k physical sector size), it will be 5-7 years before such a modified VM layer would make its appearance in a Red Hat Enterprise Linux kernel, or the equivalent enterprise or LTS kernel for SuSE or Ubuntu. 还要考虑一下，即使你雇用一个Linux VM专家来完成这项工作（硬盘供应商会非常喜欢你，因为他们也希望能够部署具有32k或64k物理扇区大小的HDD），它将会在这样一个经过修改的VM层出现在Red Hat Enterprise Linux内核或SuSE或Ubuntu的等效企业或LTS内核之前5到7年。 So if you are working at a startup who is hoping to sell your SSD product into the enterprise market --- you might as well give up now with this approach. 因此，如果您正在一家希望将您的SSD产品销售到企业市场的初创公司工作 - 您现在也可以放弃这种方法。 It's just not going to work before you run out of money. 在资金耗尽之前，它不会起作用。

Now, if you happen to be working for a large Cloud company who is making their own hardware (ala Facebook, Amazon, Google, etc.) maybe you could go down this particular path, since they don't use enterprise kernels that add new features at a glacial pace --- but for that reason, they want to stick relatively close to the upstream kernel to minimize their maintenance cost. 现在，如果您正在为一家正在制作自己的硬件（如Facebook，亚马逊，谷歌等）的大型云公司工作，也许您可以走这条特定的道路，因为他们不使用添加新的企业内核以冰川的速度运行 - 但出于这个原因，他们希望相对靠近上游内核，以最大限度地降低维护成本。

If you do work for one of these large cloud companies, I'd strongly recommend that you contact other companies who are in this same space, and maybe you could collaborate with them to see if together you could do this kind of development work and together try to get this kind of change upstream. 如果您为这些大型云计算公司之一工作，我强烈建议您联系同一领域的其他公司，也许您可以与他们合作，看看您是否可以一起完成这种开发工作尝试在上游进行这种改变。 It really, really is not a trivial change, though --- especially since the upstream linux kernel developers will demand that this not negatively impact performance in the common case, which will not be involving > 4k block devices any time in the near future. 这真的，真的不是一件简单的变化，但---特别是由于上游Linux内核开发人员的需求，这不是负在通常情况下，这将不涉及> 4K块设备在不久的将来任何时候的表现产生影响。 And if you work at a Facebook, Google, Amazon, etc., this is not the sort of change that you would want to maintain as a private change to your kernel, but something that you would want to get upstream, since other wise it would be such a massive, invasive change that supporting it as an out-of-tree patch would be huge headache. 如果你在Facebook，谷歌，亚马逊等地工作，这不是你想要作为内核的私人改变而维护的那种改变，而是你想要上游的东西，因为其他方面它将是一个巨大的，侵略性的变化，支持它作为一个树外补丁将是一个巨大的头痛。

Although I've never written a device driver for Linux , I find it very unlikely that this is a real limitation of the driver interface. 虽然我从未为Linux编写过设备驱动程序，但我发现这不太可能是驱动程序接口的真正限制。 I guess it's possible that you would want to break I/O into scatter-gather lists where each entry in the list is one page long (to improve memory allocation performance and decrease memory fragmentation), but most device types can handle those directly nowadays, and I don't think anything in the driver interface actually requires it. 我想你可能希望将I / O分解为分散 - 收集列表，其中列表中的每个条目长一页（以提高内存分配性能并减少内存碎片），但大多数设备类型现在可以直接处理这些，而且我认为驱动程序界面中的任何内容都不需要它。 In fact, the simplest way that requests are issued to block devices (described on page 13 -- marked as page 476 -- of that text) looks like it receives: 实际上，向块设备发出请求的最简单方法（在第13页上描述 - 标记为该文本的第476页）看起来像是接收到：

a sector start number 扇区起始编号
a number of sectors to transfer ( no limit is mentioned , let alone a limit of 8 512B sectors) 一些部门要转移（ 没有提到限制 ，更不用说8个512B扇区的限制）
a pointer to write the data into / read the data from (not a scatter-gather list for this simple case, I guess) 指向将数据写入/读取数据的指针（对于这个简单的情况，不是分散 - 收集列表，我猜）
whether this is a read versus a write 这是读取还是写入

I suspect that if you're seeing exclusively 4K accesses it's probably a result of the caller not requesting more than 4K at a time -- if the filesystem you're running on top of your device only issues 4K reads, or whatever is using the filesystem only accesses one block at a time, there is nothing your device driver can do to change that on its own! 我怀疑，如果你只看到4K访问，这可能是调用者一次请求超过4K的结果 - 如果你在设备上运行的文件系统只发出4K读取，或者任何使用文件系统一次只能访问一个块，设备驱动程序无法自行更改它！

Using one block at a time is common for random access patterns like database read workloads, but database log or FS journal writes or large serial file reads on a traditional (not copy-on-write) filesystem would issue large I/Os more like what you're expecting. 对于像数据库读取工作负载这样的随机访问模式，一次使用一个块是常见的，但是传统（非写入时复制）文件系统上的数据库日志或FS日志写入或大型串行文件读取会发出大的I / O更像是什么你期待的。 If you want to try issuing large reads against your device directly to see if it's possible through whatever driver you have now, you could use dd if=/dev/rdiskN of=/dev/null bs=N to see if increasing the bs parameter from 4K to 1M shows a significant throughput increase. 如果您想尝试直接对您的设备发出大量读取以查看是否可以通过您现在拥有的任何驱动程序，您可以使用dd if=/dev/rdiskN of=/dev/null bs=N来查看是否增加了bs参数从4K到1M显示了显着的吞吐量增加。