简体   繁体   中英

Is there a faster way than fscanf while reading a large data

I am struggling with large mass structured data. I have a file which contains name and door numbers. I am using fscanf to read name and that numbers then I am storing them in smaller files using fprintf.

while ( fscanf(file, "%s %d", &people[i].name, &people[i].doorNum) > 0 ) {      
      ...
}

people is a struct array

typedef struct {                                        
   char* name;
   int doorNum;
}person;

The file I'm trying to read is 15 GB. My goal is reading them and splitting into 1GB files. It perfectly works but it takes more than 10 minutes. How can I improve these reading and writing data processes?

You don't tell us what you mean by "splitting".

It could be that reading the fields as a string of characters and an integer is useless (maybe a single string or two separate strings are enough).

Write your own scanning function with built-in knowledge of the pattern to be matched, this will certainly be more efficient. Even writing you own conversion to integer should be better.

fscanf() has a lot of feature which are not likely to be used at the same time, which make it slower. I suggest that you code your own function using fread() . As your function will have only one specific tasks it should be faster.

All files contain binary data. Some formats are efficient, and some aren't.

For example; to store the number 0x1234 you could store it as the 2-byte sequence 0x34, 0x12 so that it can be reconstructed with a small number of simple/fast operations (eg value = buffer[pos] | (buffer[pos+1] << 8); ). This would be relatively efficient.

Alternatively; you could store it as the 5-byte sequence 0x34, 0x36, 0x36, 0x40, 0x00 where each byte represents an ASCII character in a string (with a zero terminator at the end); then you could scan the bytes and convert them from decimal to integer using an expensive loop like this:

    while( (c = buffer[pos++]) != 0) {
        if( (c < '0') || (c > '9') ) {
             // Error condition(!)
        }
        value = value * 10 + c - '0';
     }

Then you could make it worse by wrapping it in "convenience" (eg fscanf() ) where the code has to scan a format string just to figure out that it needs to do something like that expensive loop.

Basically; if you care about performance and/or efficiency (including file size) you need to stop using "plain text" and design a file format to suit the data; especially when you're looking at huge 15 GB files.

EDIT: Added everything below!

If you're stuck with "plain text", then you can get a little more performance by doing more of the parsing yourself (eg using functions like atoi() , etc). The next step beyond that is to use your own (more specialised) routines instead of functions like atoi() .

The next step beyond that is to use a deterministic finite state machine. The general idea might go something like:

    switch( state | buffer[pos++] ) {
        case START_OF_LINE | 'A':
        case START_OF_LINE | 'B':
        case START_OF_LINE | 'C':
            string_start = pos - 1;
            string_length = 1;
            state = GETTING_NAME;
            break;
        case GETTING_NAME | 'A':
        case GETTING_NAME | 'B':
        case GETTING_NAME | 'C':
            string_length++;
            break;
        case GETTING_NAME | ' ':
            number = 0;
            state = GETTING_NUMBER;
            break;
        case GETTING_NUMBER | '0':
            number = number * 10;
            break;
        case GETTING_NUMBER | '1':
            number = number * 10 + 1;
            break;
        case GETTING_NUMBER | '2':
            number = number * 10 + 2;
            break;
        case GETTING_NUMBER | '\n':
            create_structure(string, string_length, number);
            line++;
            state = START_OF_LINE;
            break;
        default:
            // Invalid character
            printf("Parse error at line %u!\n", line);
            break;
    }

Hopefully the compiler takes that huge switch() that you end up with and optimises it into a fast jump table. Of course constructing something like this by hand is painful and error prone; and you'd probably be able to find a "parser generator" that does it for you (based on rules).

The next step beyond that is multi-threading. For example, you can have a thread that scans through the file searching for '\\n' characters, and when it finds one it hands the line off to a worker thread (where the worker thread can use any of the methods above to parse the line). In that way you can have multiple CPUs all parsing in parallel.

In addition to all of that; you want to be loading data from disk while you're parsing the data. For example; while you're processing the first MiB of data you want to be loading the second MiB of data in parallel; and you don't want to load 1 MiB then parse 1 MiB, then load the next MiB then parse the next MiB, etc. To do this you need to use something like (eg) POSIX asynchronous IO functions; or alternatively (on a 64-bit OS that supports pre-fetching) memory mapped files.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM