在 C 中使用 mmap 多线程读取文件

Question

I'm trying to read a large .txt file in C. I've done a version with fgets() but the performance is limitted by I/O.我正在尝试用 C 读取一个大的 .txt 文件。我已经用 fgets() 完成了一个版本，但性能受到 I/O 的限制。 So I need something else could do better performance than fgets(), and I found that mmap() wont be limmited by I/O.所以我需要一些比 fgets() 性能更好的东西，而且我发现 mmap() 不会受到 I/O 的限制。 So my question is, is it possible to do that with mmap() and multi threaded(POSIX Thread)?所以我的问题是，是否可以使用 mmap() 和多线程（POSIX 线程）来做到这一点？ And here is what I need:这是我需要的：

Different threads to read(mmap() or something else) different parts of the file simultaneously

I can't found any resource about mmap() with multi threading online , could someone please help me with some example code and explain?我在网上找不到关于多线程 mmap() 的任何资源，有人可以帮我提供一些示例代码并解释一下吗？ I would be very grateful to your help , thanks我将非常感谢您的帮助，谢谢

Answer 1

Your idea itself is not bad.你的想法本身并不坏。 If we assume a newline delimited file (that is: you can cut between lines without a porblem) you can find teh limtis of the blocks with something like that (ripped out from another program of mine, so please check first)如果我们假设一个换行符分隔的文件（即：您可以在没有问题的情况下在行之间进行剪切），您可以找到类似这样的块的 limtis（从我的另一个程序中删除，所以请先检查）

// just in case
#define _LARGEFILE_SOURCE
#define _BSD_SOURCE
#define _POSIX_C_SOURCE 200112L

#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <errno.h>
#include <string.h>

// TODO: should be calculated
#define FILE_PARTS 100   
// TODO: should not be global
off_t positions[FILE_PARTS + 1];

int slice_file(FILE * fp)
{
  off_t curr_pos = 0;
  off_t filesize = 0;
  off_t chunk_size = 0;
  int fd;
  int i, res;
  char c;

  struct stat sb;

  // get size of file
  fd = fileno(fp);
  if (fd == -1) {
    fprintf(stderr, "EBADF in prepare_and_backup() for data-file pointer\n");
    return 0;
  }

  if (fstat(fd, &sb) == -1) {
    fprintf(stderr, "fstat() failed\n");
    return 0;
  }
  // check if it is a regular file
  if ((sb.st_mode & S_IFMT) != S_IFREG) {
    fprintf(stderr, "Not a regular file\n");
    return 0;
  }
  // TODO: check if filesize and chunksize >> 1
  filesize = sb.st_size;
  chunk_size = filesize / ((off_t) FILE_PARTS);

  positions[0] = 0;
  curr_pos = 0;

  for (i = 1; i < FILE_PARTS; i++) {
    res = fseeko(fp, curr_pos, SEEK_SET);
    if (res == -1) {
      fprintf(stderr, "Error in fseeko(): %s\n",
              strerror(errno));
      return 0;
    }
    curr_pos += chunk_size;
    // look for the end of the line to cut at useful places
    while ((c = fgetc(fp)) != EOF) {
      curr_pos++;
      // TODO: add code to honor Apple's special needs
      if (c == '\n') {
        c = fgetc(fp);
        if (c == EOF) {
          break;
        }
        curr_pos++;
        break;
      }
    }
    positions[i] = curr_pos - 1;
  }
  // Position of the end of the file
  positions[i] = filesize;
  // Is that even needed?
  rewind(fp);
  return 1;
}

Now you can start a thread, give it start and end of the block it shall work at (which you may or may not have calculated with the function above) and do the (m)mapping inside the individual threads without worry.现在你可以启动一个线程，给它开始和结束它应该工作的块（你可能已经或可能没有用上面的函数计算过）并且不用担心在各个线程内进行（m）映射。 If the output is of the same size as the block you can even work in-place.如果输出与块大小相同，您甚至可以就地工作。

EDIT编辑

The declaration of mmap is mmap的声明是

void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);

If you don't care for a specific address you set it to NULL .如果您不关心特定地址，请将其设置为NULL 。
length is the number of bytes you want the map to get initialized to, that is in this case: filled with content from the file-descriptor fd . length是您希望映射初始化的字节数，在这种情况下：填充来自文件描述符fd 。
The start of that filling is set by offset with one, uncomfortable caveat: it needs to be a multiple of the page-size (ask sysconf(_SC_PAGE_SIZE) for the exact number).该填充的开始由offset设置，并带有一个不舒服的警告：它需要是页面大小的倍数（询问sysconf(_SC_PAGE_SIZE)以获得确切的数字）。 Not much of a problem, just set it to the page before the start and start the work at the actual start, all necessary information exists.没什么大问题，只需将其设置为启动前的页面并在实际启动时开始工作，所有必要的信息都存在。 You can (and have to!) ignore the rest of that page.您可以（并且必须！）忽略该页面的其余部分。

Or you take the whole file and map it and use it as you would use a file on the drive: give every thread a block of that map (necessary information in positions ) and work from there.或者你获取整个文件并映射它并像使用驱动器上的文件一样使用它：给每个线程一个该映射的块（ positions必要信息）并从那里开始工作。

Advantage of the first: You have several blocks of memory which can be shoved around more easily by the OS and you may or may not have less cache misses with multiple CPUs.第一个的优点：您有多个内存块，操作系统可以更轻松地将它们推到周围，并且您可能会或可能不会减少多个 CPU 的缓存未命中。 It is a must even, if you run a cluster or any other architecture where every CPU/group of CPUs has it's very own RAM or at least a very large cache.如果您运行一个集群或任何其他架构，其中每个 CPU/CPU 组都有自己的 RAM 或至少一个非常大的缓存，那么它甚至是必须的。

Advantage of the latter: simpler to implement but you have one large clump of a map.后者的优点：实现更简单，但你有一大块地图。 That may or may not influence the run-time.这可能会也可能不会影响运行时间。

Hint: my experiences with the modern, fast SSDs: the reading speeds are so high these days you can easily start with direct file access instead of mapping.提示：我对现代、快速 SSD 的体验：如今读取速度如此之高，您可以轻松地从直接文件访问而不是映射开始。 Even with a rather slow, "normal" HDD you get reasonable speeds.即使使用相当慢的“正常”硬盘，您也可以获得合理的速度。 The program from which I ripped that snippet above had to search an over 120 GB large CSV file, with not enough RAM to load it fully, not even enough space on the drive to load it into some DB (yes, that was a couple of years ago).我从中翻录上述片段的程序必须搜索超过 120 GB 的大型 CSV 文件，没有足够的 RAM 来完全加载它，驱动器上甚至没有足够的空间将其加载到某个数据库中（是的，那是几个几年前）。 It was a key->"lot, of, different, values" file, and thankfully already sorted.这是一个键->“很多、不同、不同的值”文件，幸运的是已经排序。 So I made a small (as big as I could fit on the drive) index file for it with the method above (KEY->position) although much more blocks than the 100 in my example.所以我用上面的方法（KEY->position）为它制作了一个小的（尽可能大的）索引文件，尽管比我的例子中的 100 块多得多。 The keys in the index-file were also sorted, so you found the right block if the key your were searching for was bigger (data was sorted in ascending order) than the index-entry which means that the key is in the block before that position if it exists.索引文件中的键也已排序，因此如果您要搜索的键比索引条目大（数据按升序排序），则您找到了正确的块，这意味着该键位于之前的块中位置（如果存在）。 The blocks were small enough to keep some of them in RAM to work as a cache but that gained not much, the incoming requests were quite uniformly random.这些块足够小，可以将它们中的一些保留在 RAM 中以用作缓存，但这并没有增加多少，传入的请求非常一致地随机。

A poor-man's DB so to say, and fast enough to do the job without complaints from the users.一个穷人的数据库可以这么说，而且速度足够快，可以在没有用户抱怨的情况下完成工作。

A funny side-note: the keys were alphanumerical and the sort algorithm sorted them "aAbBcC...", that means that you can't use strcmp directly.一个有趣的旁注：键是字母数字，排序算法将它们排序为“aAbBcC...”，这意味着您不能直接使用strcmp 。 Made me scratch my head for a while but the solution is rather simple: compare ignoring case (eg: strcasecmp if available) and if it is not equal return that result, otherwise return the opposite of the result of a normal strncmp (eg just return -strcmp(a,b); ).让我挠了一阵子，但解决方案相当简单：比较忽略大小写（例如： strcasecmp如果可用），如果不相等，则返回该结果，否则返回与正常strncmp结果相反的结果（例如，只return -strcmp(a,b); )。

You were quite mute about the kind of data you need to work at, so the above might not have been of the slightest interest to you.您对需要处理的数据类型保持沉默，因此您可能对上述内容没有丝毫兴趣。

Answer 2

The linux manual page for mmap states: mmap的 linux 手册页指出：

mmap - map files or devices into memory mmap - 将文件或设备映射到内存中

#include <sys/mman.h>
void *mmap(void *addr, size_t len, int prot, int flags, int fildes, off_t off);

The description for mmap says: mmap的描述说：

mmap() creates a new mapping in the virtual address space of the calling process. mmap() 在调用进程的虚拟地址空间中创建一个新映射。 The starting address for the new mapping is specified in addr.新映射的起始地址在 addr 中指定。 The length argument specifies the length of the mapping. length 参数指定映射的长度。

And here's a code example from the man pages.这是手册页中的代码示例。

#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#define handle_error(msg) do { perror(msg); exit(EXIT_FAILURE); } while (0)
int main(int argc, char *argv[])
{
    char *addr;
    int fd;
    struct stat sb;
    off_t offset, pa_offset;
    size_t length;
    ssize_t s;
    if (argc < 3 || argc > 4) {
        fprintf(stderr, "%s file offset [length]\n", argv[0]);
        exit(EXIT_FAILURE);
    }
    fd = open(argv[1], O_RDONLY);
    if (fd == -1)
        handle_error("open");
    if (fstat(fd, &sb) == -1)           /* To obtain file size */
        handle_error("fstat");
    offset = atoi(argv[2]);
    pa_offset = offset & ~(sysconf(_SC_PAGE_SIZE) - 1);
        /* offset for mmap() must be page aligned */
    if (offset >= sb.st_size) {
        fprintf(stderr, "offset is past end of file\n");
        exit(EXIT_FAILURE);
    }
    if (argc == 4) {
        length = atoi(argv[3]);
        if (offset + length > sb.st_size)
            length = sb.st_size - offset;
    } else {    /* No length arg ==> display to end of file */
        length = sb.st_size - offset;
    }
    addr = mmap(NULL, length + offset - pa_offset, PROT_READ,
                MAP_PRIVATE, fd, pa_offset);
    if (addr == MAP_FAILED)
        handle_error("mmap");
    s = write(STDOUT_FILENO, addr + offset - pa_offset, length);
    if (s != length) {
        if (s == -1)
            handle_error("write");
        fprintf(stderr, "partial write");
        exit(EXIT_FAILURE);
    }
    exit(EXIT_SUCCESS);
}

None of this is my work, it's all from the Linux manual pages.这些都不是我的工作，全部来自 Linux 手册页。

在 C 中使用 mmap 多线程读取文件

问题描述

2 个解决方案

解决方案1
1 2016-09-26 22:49:41

解决方案2
0 2016-09-26 22:19:53

在 C 中使用 mmap 多线程读取文件

问题描述

2 个解决方案

解决方案1 1 2016-09-26 22:49:41

解决方案2 0 2016-09-26 22:19:53

解决方案1
1 2016-09-26 22:49:41

解决方案2
0 2016-09-26 22:19:53