简体   繁体   中英

read file with mmap multiple threading in C

I'm trying to read a large .txt file in C. I've done a version with fgets() but the performance is limitted by I/O. So I need something else could do better performance than fgets(), and I found that mmap() wont be limmited by I/O. So my question is, is it possible to do that with mmap() and multi threaded(POSIX Thread)? And here is what I need:

Different threads to read(mmap() or something else) different parts of the file simultaneously

I can't found any resource about mmap() with multi threading online , could someone please help me with some example code and explain? I would be very grateful to your help , thanks

Your idea itself is not bad. If we assume a newline delimited file (that is: you can cut between lines without a porblem) you can find teh limtis of the blocks with something like that (ripped out from another program of mine, so please check first)

// just in case
#define _LARGEFILE_SOURCE
#define _BSD_SOURCE
#define _POSIX_C_SOURCE 200112L

#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <errno.h>
#include <string.h>

// TODO: should be calculated
#define FILE_PARTS 100   
// TODO: should not be global
off_t positions[FILE_PARTS + 1];

int slice_file(FILE * fp)
{
  off_t curr_pos = 0;
  off_t filesize = 0;
  off_t chunk_size = 0;
  int fd;
  int i, res;
  char c;

  struct stat sb;

  // get size of file
  fd = fileno(fp);
  if (fd == -1) {
    fprintf(stderr, "EBADF in prepare_and_backup() for data-file pointer\n");
    return 0;
  }

  if (fstat(fd, &sb) == -1) {
    fprintf(stderr, "fstat() failed\n");
    return 0;
  }
  // check if it is a regular file
  if ((sb.st_mode & S_IFMT) != S_IFREG) {
    fprintf(stderr, "Not a regular file\n");
    return 0;
  }
  // TODO: check if filesize and chunksize >> 1
  filesize = sb.st_size;
  chunk_size = filesize / ((off_t) FILE_PARTS);

  positions[0] = 0;
  curr_pos = 0;

  for (i = 1; i < FILE_PARTS; i++) {
    res = fseeko(fp, curr_pos, SEEK_SET);
    if (res == -1) {
      fprintf(stderr, "Error in fseeko(): %s\n",
              strerror(errno));
      return 0;
    }
    curr_pos += chunk_size;
    // look for the end of the line to cut at useful places
    while ((c = fgetc(fp)) != EOF) {
      curr_pos++;
      // TODO: add code to honor Apple's special needs
      if (c == '\n') {
        c = fgetc(fp);
        if (c == EOF) {
          break;
        }
        curr_pos++;
        break;
      }
    }
    positions[i] = curr_pos - 1;
  }
  // Position of the end of the file
  positions[i] = filesize;
  // Is that even needed?
  rewind(fp);
  return 1;
}

Now you can start a thread, give it start and end of the block it shall work at (which you may or may not have calculated with the function above) and do the (m)mapping inside the individual threads without worry. If the output is of the same size as the block you can even work in-place.

EDIT

The declaration of mmap is

void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);

If you don't care for a specific address you set it to NULL .
length is the number of bytes you want the map to get initialized to, that is in this case: filled with content from the file-descriptor fd .
The start of that filling is set by offset with one, uncomfortable caveat: it needs to be a multiple of the page-size (ask sysconf(_SC_PAGE_SIZE) for the exact number). Not much of a problem, just set it to the page before the start and start the work at the actual start, all necessary information exists. You can (and have to!) ignore the rest of that page.

Or you take the whole file and map it and use it as you would use a file on the drive: give every thread a block of that map (necessary information in positions ) and work from there.

Advantage of the first: You have several blocks of memory which can be shoved around more easily by the OS and you may or may not have less cache misses with multiple CPUs. It is a must even, if you run a cluster or any other architecture where every CPU/group of CPUs has it's very own RAM or at least a very large cache.

Advantage of the latter: simpler to implement but you have one large clump of a map. That may or may not influence the run-time.

Hint: my experiences with the modern, fast SSDs: the reading speeds are so high these days you can easily start with direct file access instead of mapping. Even with a rather slow, "normal" HDD you get reasonable speeds. The program from which I ripped that snippet above had to search an over 120 GB large CSV file, with not enough RAM to load it fully, not even enough space on the drive to load it into some DB (yes, that was a couple of years ago). It was a key->"lot, of, different, values" file, and thankfully already sorted. So I made a small (as big as I could fit on the drive) index file for it with the method above (KEY->position) although much more blocks than the 100 in my example. The keys in the index-file were also sorted, so you found the right block if the key your were searching for was bigger (data was sorted in ascending order) than the index-entry which means that the key is in the block before that position if it exists. The blocks were small enough to keep some of them in RAM to work as a cache but that gained not much, the incoming requests were quite uniformly random.

A poor-man's DB so to say, and fast enough to do the job without complaints from the users.

A funny side-note: the keys were alphanumerical and the sort algorithm sorted them "aAbBcC...", that means that you can't use strcmp directly. Made me scratch my head for a while but the solution is rather simple: compare ignoring case (eg: strcasecmp if available) and if it is not equal return that result, otherwise return the opposite of the result of a normal strncmp (eg just return -strcmp(a,b); ).

You were quite mute about the kind of data you need to work at, so the above might not have been of the slightest interest to you.

The linux manual page for mmap states:

mmap - map files or devices into memory

#include <sys/mman.h>
void *mmap(void *addr, size_t len, int prot, int flags, int fildes, off_t off);

The description for mmap says:

mmap() creates a new mapping in the virtual address space of the calling process. The starting address for the new mapping is specified in addr. The length argument specifies the length of the mapping.

And here's a code example from the man pages.

#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#define handle_error(msg) do { perror(msg); exit(EXIT_FAILURE); } while (0)
int main(int argc, char *argv[])
{
    char *addr;
    int fd;
    struct stat sb;
    off_t offset, pa_offset;
    size_t length;
    ssize_t s;
    if (argc < 3 || argc > 4) {
        fprintf(stderr, "%s file offset [length]\n", argv[0]);
        exit(EXIT_FAILURE);
    }
    fd = open(argv[1], O_RDONLY);
    if (fd == -1)
        handle_error("open");
    if (fstat(fd, &sb) == -1)           /* To obtain file size */
        handle_error("fstat");
    offset = atoi(argv[2]);
    pa_offset = offset & ~(sysconf(_SC_PAGE_SIZE) - 1);
        /* offset for mmap() must be page aligned */
    if (offset >= sb.st_size) {
        fprintf(stderr, "offset is past end of file\n");
        exit(EXIT_FAILURE);
    }
    if (argc == 4) {
        length = atoi(argv[3]);
        if (offset + length > sb.st_size)
            length = sb.st_size - offset;
    } else {    /* No length arg ==> display to end of file */
        length = sb.st_size - offset;
    }
    addr = mmap(NULL, length + offset - pa_offset, PROT_READ,
                MAP_PRIVATE, fd, pa_offset);
    if (addr == MAP_FAILED)
        handle_error("mmap");
    s = write(STDOUT_FILENO, addr + offset - pa_offset, length);
    if (s != length) {
        if (s == -1)
            handle_error("write");
        fprintf(stderr, "partial write");
        exit(EXIT_FAILURE);
    }
    exit(EXIT_SUCCESS);
}

None of this is my work, it's all from the Linux manual pages.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM