简体   繁体   中英

How to parse files that cannot fit entirely in memory RAM

I have created a framework to parse text files of reasonable size that can fit in memory RAM, and for now, things are going well. I have no complaints, however what if I encountered a situation where I have to deal with large files, say, greater than 8GB(which is the size of mine)? What would be an efficient approach to deal with such large files?

My framework:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>

int Parse(const char *filename,
    const char *outputfile);

int main(void)
{
    clock_t t1 = clock();
    /* ............................................................................................................................. */
    Parse("file.txt", NULL);
    /* ............................................................................................................................. */
    clock_t t2 = clock();
    fprintf(stderr, "time elapsed: %.4f\n", (double)(t2 - t1) / CLOCKS_PER_SEC);
    fprintf(stderr, "Press any key to continue . . . ");
    getchar();
    return 0;
}

long GetFileSize(FILE * fp)
{
    long f_size;
    fseek(fp, 0L, SEEK_END);
    f_size = ftell(fp);
    fseek(fp, 0L, SEEK_SET);
    return f_size;
}

char *dump_file_to_array(FILE *fp,
    size_t f_size)
{
    char *buf = (char *)calloc(f_size + 1, 1);
    if (buf) {
        size_t n = 0;
        while (fgets(buf + n, INT_MAX, fp)) {
            n += strlen(buf + n);
        }
    }
    return buf;
}

int Parse(const char *filename,
    const char *outputfile)
{
    /* open file for reading in text mode */
    FILE *fp = fopen(filename, "r");
    if (!fp) {
        perror(filename);
        return 1;
    }
    /* store file in dynamic memory and close file */
    size_t f_size = GetFileSize(fp);
    char *buf = dump_file_to_array(fp, f_size);
    fclose(fp);
    if (!buf) {
        fputs("error: memory allocation failed.\n", stderr);
        return 2;
    }
    /* state machine variables */
    // ........

    /* array index variables */
    size_t x = 0;
    size_t y = 0;
    /* main loop */
    while (buf[x]) {
        switch (buf[x]) {
            /* ... */
        }
        x++;
    }
    /* NUL-terminate array at y */
    buf[y] = '\0';
    /* write buffer to file and clean up */
    outputfile ? fp = fopen(outputfile, "w") :
                 fp = fopen(filename, "w");
    if (!fp) {
        outputfile ? perror(outputfile) :
                     perror(filename);
    }
    else {
        fputs(buf, fp);
        fclose(fp);
    }
    free(buf);
    return 0;
}

Pattern deletion function based on the framework:

int delete_pattern_in_file(const char *filename,
    const char *pattern, const char *outputfile)
{
    /* open file for reading in text mode */
    FILE *fp = fopen(filename, "r");
    if (!fp) {
        perror(filename);
        return 1;
    }
    /* copy file contents to buffer and close file */
    size_t f_size = GetFileSize(fp);
    char *buf = dump_file_to_array(fp, f_size);
    fclose(fp);
    if (!buf) {
        fputs("error - memory allocation failed", stderr);
        return 2;
    }
    /* delete first match */
    size_t n = 0, pattern_len = strlen(pattern);
    char *tmp, *ptr = strstr(buf, pattern);
    if (!ptr) {
        fputs("No match found.\n", stderr);
        free(buf);
        return -1;
    }
    else {
        n = ptr - buf;
        ptr += pattern_len;
        tmp = ptr;
    }
    /* delete the rest */
    while (ptr = strstr(ptr, pattern)) {
        while (tmp < ptr) {
            buf[n++] = *tmp++;
        }
        ptr += pattern_len;
        tmp = ptr;
    }
    /* copy the rest of the buffer */
    strcpy(buf + n, tmp);
    /* open file for writing and print the processed buffer to it */
    outputfile ? fp = fopen(outputfile, "w") :
                 fp = fopen(filename, "w");
    if (!fp) {
        outputfile ? perror(outputfile) :
                     perror(filename);
    }
    else {
        fputs(buf, fp);
        fclose(fp);
    }
    free(buf);
    return 0;
}

If you wish to stick with your current design, an option might be to mmap() the file instead of reading it into a memory buffer.

You could change the function dump_file_to_array to the following (linux-specific):

char *dump_file_to_array(FILE *fp, size_t f_size) {
   buf = mmap(NULL, f_size, PROT_READ, MAP_SHARED, fileno(fp), 0);
   if (buf == MAP_FAILED)
       return NULL;
   return buf;
}

Now you can read over the file, the memory manager will take automatically care to only hold the relevant potions of the file in memory. For Windows, similar mechanisms exist.

Chances you are parsing the file line-by line. So read in a large block (4k or 16k) and parse all the lines in that. Copy the small remainder to the beginning of the 4k or 16k buffer and read in the rest of the buffer. Rinse and repeat.

For JSON or XML you will need an event based parser that can accept multiple blocks or input.

There are multiple issues with your approach.

The concept of maximum and available memory are not so evident: technically, you are not limited by the RAM size, but by the quantity of memory your environment will let you allocate and use for your program. This depends on various factors:

  • What ABI you compile for: the maximum memory size accessible to your program is limited to less than 4 GB if you compile for 32-bit code, even if your system has more RAM than that.
  • What quota the system is configured to let your program use. This may be less than available memory.
  • What strategy the system uses when more memory is requested than is physically available: most modern systems use virtual memory and share physical memory between processes and system tasks (such as the disk cache) using very advanced algorithms that cannot be describe in a few lines. It is possible on some systems for your program to allocate and use more memory than is physically installed on the motherboard, swapping memory pages to disk as more memory is accessed, at a huge cost in lag time.

There are further issues in your code:

  • The type long might be too small to hold the size of the file: on Windows systems, long is 32-bit even on 64-bit versions where memory can be allocated in chunks larger than 2GB. You must use different API to request the file size from the system.
  • You read the file with an series of calls to fgets() . This is inefficient, a single call to fread() would suffice. Furthermore, if the file contains embedded null bytes ('\\0' characters), chunks from the file will be missing in memory. However you could not deal with embedded null bytes if you use string functions such as strstr() and strcpy() to handle your string deletion task.
  • the condition in while (ptr = strstr(ptr, pattern)) is an assignment. While not strictly incorrect, it is poor style as it confuses readers of your code and prevents life saving warnings by the compiler where such assignment-conditions are coding errors. You might think that could never happen, but anyone can make a typo and a missing = in a test is difficult to spot and has dire consequences.
  • you short-hand use of the ternary operator in place of if statements is quite confusing too: outputfile ? fp = fopen(outputfile, "w") : fp = fopen(filename, "w"); outputfile ? fp = fopen(outputfile, "w") : fp = fopen(filename, "w");
  • rewriting the input file in place is risky too: if anything goes wrong, the input file will be lost.

Note that you can implement the filtering on the fly, without a buffer, albeit inefficiently:

#include <stdio.h>
#include <string.h>

int main(int argc, char *argv[]) {
    if (argc < 2) {
        fprintf(stderr, "usage: delpat PATTERN < inputfile > outputfile\n");
        return 1;
    }
    unsigned char *pattern = (unsigned char*)argv[1];
    size_t i, j, n = strlen(argv[1]);
    size_t skip[n + 1];
    int c;

    skip[0] = 0;
    for (i = j = 1; i < n; i++) {
        while (memcmp(pattern, pattern + j, i - j)) {
            j++;
        }
        skip[i] = j;
    }

    i = 0;
    while ((c = getchar()) != EOF) {
        for (;;) {
            if (i < n && c == pattern[i]) {
                if (++i == n) {
                    i = 0; /* match found, consumed */
                }
                break;
            }
            if (i == 0) {
                putchar(c);
                break;
            }
            for (j = 0; j < skip[i]; j++) {
                putchar(pattern[j]);
            }
            i -= skip[i];
        }
    }
    for (j = 0; j < i; j++) {
        putchar(pattern[j]);
    }
    return 0;
}

First of all I wouldn't suggest holding such big files in RAM but instead using streams. This because buffering is usually done by the library as well as by the kernel.

If you are accessing the file sequentially, which seems to be the case, then you probably know that all modern systems implement read-ahead algorithms so just reading the whole file ahead of time IN RAM may in most cases just waste time.

You didn't specify the use-cases you have to cover so I'm going to have to assume that using streams like

std::ifstream

and doing the parsing on the fly will suit your needs. As a side note, also make sure your operations on files that are expected to be large are done in separate threads.

An alternative solution: If you're on linux systems, and you have a decent amount of swap space, just open the whole bad boy up. It will consume your ram and also consume harddrive space (swap). Thus you can have the entire thing open at once, just not all of it will be on the ram.

Pros

  • If an unexpected shut down occurred, the memory on the swap space is recoverable.
  • RAM is expensive, HDDs are cheap, so the application would put less strain on your expensive equipment
  • Virus could not harm your computer because there would be no room in RAM for them to run
  • You'll be taking full advantage of the Linux operating system by using the swap space. Normally the swap space module is not used and all it does is clog up precious ram.
  • The additional energy that is needed to utilize the entirety of the ram can warm the immediate area. Useful during winter time
  • You can add "Complex and Special Memory Allocation Engineering" to your resume.

Cons

  • None

Consider treating the file as an external array of lines .

Code can use an array of line indexes. This index array can be kept in memory at a fraction of the size of the large file. Access to any line is accomplished quickly via this lookup, a seek with fsetpos() and an fread()/fgets() . As the lines are edited, the new lines can be saved, in any order, in temporary text file. Saving of the file reads both the original file and temp one in sequence to form and write the new file.

typedef struct {
  int attributes; // not_yet_read, line_offset/length_determined, 
                  // line_changed/in_other_file, deleted, etc.
  fpos_t line_offset;   // use with  fgetpos() fsetpos()
  unsigned line_length; // optional field as code could re-compute as needed.
} line_index;

size_t line_count;
// read some lines
line_index *index = malloc(sizeof *index * line_count);
// read more lines
index = realloc(index, sizeof *index * line_count);
// edit lines, save changes to appended temporary file.
// ...
// Save file -weave the contents of the source file and temp file to the new output file.

Additionally, with enormous files, the array line_index[] itself can be realized in disk memory too. Access to is easily computed. In an extreme sense, only 1 line of the file needs to in memory at any time.

You mentioned state-machine. Every finite-state-automata can be optimized to have minimal (or no) lookahead.

Is it possible to do this in Lex? It will generate output c file which you can compile.

If you don't want to use Lex, you can always do following:

  1. Read n chars into (ring?) buffer where n is size of pattern.
  2. Try to match buffer with pattern
  3. If match goto 1
  4. Print buffer[0], read char, goto 2

Also for very long patterns and degenerate inputs strstr can be slow. In that case you might want to look into more advanced sting matching aglorithms.

mmap() is a pretty good way of working on files with large sizes. It provides you with lot of flexibility but you need to be cautious with page size. Here is a good article which talks about more specifics.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM