简体   繁体   中英

Use fscanf to read in variable numbers of integer

I have over 100,000 csv files in the below format:

1,1,5,1,1,1,0,0,6,6,1,1,1,0,1,0,13,4,7,8,18,20,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,1,6,5,1,1,1,0,1,0,4,7,8,18,20,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,2,6,5,1,1,1,0,1,0,4,7,8,18,20,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,3,6,5,1,1,1,0,1,0,13,4,7,8,20,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,4,6,5,1,1,1,0,1,0,13,4,7,8,20,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,5,6,4,1,0,1,0,1,0,4,8,18,20,,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,6,6,5,1,1,1,0,1,0,4,7,8,18,20,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,7,6,5,1,1,1,0,1,0,13,4,7,8,20,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,8,6,5,1,1,1,0,1,0,13,4,7,8,20,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,2,0,0,12,12,1,2,4,1,1,0,13,4,7,8,18,20,21,25,27,29,31,32,,,,,,,,,,,,,,,,

All I need is field 10 and field 17 onward, field 10 is the counter indicate how many integer stored start from field 17 ie what I need is:

6,13,4,7,8,18,20
5,4,7,8,18,20
5,4,7,8,18,20
5,13,4,7,8,20
5,13,4,7,8,20
4,4,8,18,20
5,4,7,8,18,20
5,13,4,7,8,20
5,13,4,7,8,20
12,13,4,7,8,18,20,21,25,27,29,31,32

Max number of integer need to read is 28. I can easily achieve this by Getline in C++, however, from my previous experience, since I need to handle over 100,000 such files and each files may have 300,000~400,000 such lines. Therefore using Getline to read in the data and build a vector> may have serious performance issue for me. I tried to use fscanf to achieve this:

while (!feof(stream)){
 fscanf(fstream,"%*d,%*d,%*d,%*d,%*d,%*d,%*d,%*d,%*d,%d",&MyCounter);
 fscanf(fstream,"%*d,%*d,%*d,%*d,%*d,%*d"); // skip to column 17
 for (int i=0;i<MyCounter;i++){
  fscanf(fstream,"%d",&MyIntArr[i]);
 }
 fscanf(fstream,"%*s"); // to finish the line
}

However, this will call fscanf multiple times and may also create performance issue. Is there any way to read in variable number of integer at 1 call with fscanf ? Or I need to read into a string and then strsep/stoi it ? Compare to fscanf, which is better from performance point of view?

So, there are at most 43 numbers per line. Even at 64 bits, each number is limited to 21 digits, so 1024 bytes is plenty for the max 946 bytes that a line could be (so long as there is no whitespace).

char line[1024];

while (fgets(line, sizeof(line), stdin) != NULL) {
    //...
}

A helper function to skip to the desired column.

const char *find_nth_comma(const char *s, int n) {
    const char *p = s;
    if (p && n) while (*p) {
        if (*p == ',') {
            if (--n == 0) break;
        }
        ++p;
    }
    return p;
}

So, inside your loop, skip to column 10 to find the first number of interest, and then skip to column 17 to start reading in the rest of the numbers. The completed loop looks like:

while (fgets(line, sizeof(line), stdin) != NULL) {
    const char *p = find_nth_comma(line, 9);
    char *end;
    assert(p && *p);
    MyCounter = strtol(p+1, &end, 10);
    assert(*end == ',');
    p = find_nth_comma(end+1, 6);
    assert(p && *p);
    for (int i = 0; i < MyCounter; ++i, p = end) {
        MyIntArray[i] = strtol(p+1, &end, 10);
        assert((*end == ',') ||
               (i == MyCounter-1) &&
               (*end == '\0' || isspace(*end & 0xFF)));
    }
}

This approach will work with a mmap solution as well. The fgets would be replaced with a function that points to the next line to be processed in the file. The find_nth_comma would need a modification to detect end of line/end of file rather than rely on a NUL terminated string. strtol would be changed with a custom function that again detects end of line or end of file. (The purpose of such changes is to remove any code that would require copying the data, which would be motivation for a mmap approach.)

With parallel processing, it is possible to parse multiple parts of the file simultaneously. But, it is probably sufficient to have different threads process different files, and then collate the results after all files have been processed.

Eventually I use memory mapped file to solve my problem (this solution is a side product of my previous problem, performance issue when reading big CSV file) read in large CSV file performance issue in C++

Since I work on MS Windows, so I use Stephan Brumme's "Portable Memory Mapping C++ Class" http://create.stephan-brumme.com/portable-memory-mapping/ Since I don't need to deal with file(s) > 2 GB, My implementation is simpler. For over 2GB file, visit the web to see how to handle.

Below please find my piece of code:

// may tried RandomAccess/SequentialScan
MemoryMapped MemFile(FilterBase.BaseFileName, MemoryMapped::WholeFile, MemoryMapped::RandomAccess);

// point to start of memory file
char* start = (char*)MemFile.getData();
// dummy in my case
char* tmpBuffer = start;

// looping counter
uint64_t i = 0;

// pre-allocate result vector
MyVector.resize(300000);

// Line counter
int LnCnt = 0;

//no. of field
int NumOfField=43;
//delimiter count, num of field + 1 since the leading and trailing delimiter are virtual
int DelimCnt=NoOfField+1;
//Delimiter position. May use new to allocate at run time
// or even use vector of integer
// This is to store the delimiter position in each line
// since the position is relative to start of file. if file is extremely
// large, may need to change from int to unsigner, long or even unsigned long long
static  int DelimPos[DelimCnt];

// Max number of field need to read usually equal to NumOfField, can be smaller, eg in my case, I only need 4 fields
// from first 15 field, in this case, can assign 15 to MaxFieldNeed
int MaxFieldNeed=NumOfField;
// keep track how many comma read each line
int DelimCounter=0;
// define field and line seperator
char FieldDelim=',';
char LineSep='\n';

// 1st field, "virtual Delimiter" position
DelimPos[CommaCounter]=-1
DelimCounter++;

// loop through the whole memory field, 1 and only once
for (i = 0; i < MemFile.size();i++)
{
  // grab all position of delimiter in each line
  if ((MemFile[i] == FieldDelim) && (DelimCounter<=MaxFieldNeed)){
    DelimPos[DelimCounter] = i;
    DelimCounter++;
  };

  // grab all values when end of line hit
  if (MemFile[i] == LineSep) {
    // no need to use if (DelimCounter==NumOfField) just assign anyway, waste a little bit
    // memory in integer array but gain performance 
    DelimPos[DelimCounter] = i;
    // I know exactly what the format is and what field(s) I want
    // a more general approach (as a CSV reader) may put all fields
    // into vector of vector of string
    // With *EFFORT* one may modify this piece of code so that it can parse
    // different format at run time eg similar to:
    // fscanf(fstream,"%d,%f....
    // also, this piece of code cannot handle complex CSV e.g.
    // Peter,28,157CM
    // John,26,167CM
    // "Mary,Brown",25,150CM
    MyVector.StrField = string(strat+DelimPos[0] + 1, strat+DelimPos[1] - 1);
    MyVector.IntField = strtol(strat+DelimPos[3] + 1,&tmpBuffer,10);
    MyVector.IntField2 = strtol(strat+DelimPos[8] + 1,&tmpBuffer,10);
    MyVector.FloatField = strtof(start + DelimPos[14] + 1,&tmpBuffer);
    // reset Delim counter each line
    DelimCounter=0
    // previous line seperator treat as first delimiter of next line
    DelimPos[DelimCounter] = i;
    DelimCounter++
    LnCnt++;
  }
}
MyVector.resize(LnCnt);
MyVector.shrink_to_fit();
MemFile.close();
};

I can code whatever I want inside:

  if (MemFile[i] == LineSep) {
  }

eg handle empty field, perform calculation etc. With this piece of code, I handle 2100 files (6.3 GB) in 57 seconds!!! (I code the CSV format in it and only grab 4 values in my previous case). Later will change this code to handle this issue. Thx all who help me in this issue.

In order to maximize performance, you should map the files in memory with mmap or equivalent and parse the file with ad hoc code, typically scanning one character at a time with a pointer, checking for '\\n' and/or '\\r' for end of record and converting the numbers on the fly for storage to your arrays. The tricky parts are:

  • how do you allocate or otherwise handle the destination arrays.
  • are the fields all numeric? integral?
  • is the last record terminated by a newline? You can easily check this condition after the mmap call. The advantage is you only need check for end of file when you encounter a newline sequence.

Probably the easiest way to read a run-time determined number of integers is to point into the right part of a longer format string. In other words, we can have a format string with 28 %d, specifiers, but point to the n th one before the end of string and pass that pointer as the format string for scanf() .

As a simple example, consider accepting 3 integers from a maximum of 6:

"%d,%d,%d,%d,%d,%d,"
          ^

The arrow shows the string pointer to use as the pattern argument.


Here's a full worked example; its runtime is about 8 seconds for 1 million iterations (10 million lines) when built with gcc -O3 . It's slightly complicated by the mechanics to update the input string pointer, which is obviously not necessary when reading from a file stream. I've skipped the checking that nfields <= 28 , but that's easily added.

char const *const input =
    "1,1,5,1,1,1,0,0,6,6,1,1,1,0,1,0,13,4,7,8,18,20,,,,,,,,,,,,,,,,,,,,,,\n"
    "1,1,5,1,1,1,0,1,6,5,1,1,1,0,1,0,4,7,8,18,20,,,,,,,,,,,,,,,,,,,,,,,\n"
    "1,1,5,1,1,1,0,2,6,5,1,1,1,0,1,0,4,7,8,18,20,,,,,,,,,,,,,,,,,,,,,,,\n"
    "1,1,5,1,1,1,0,3,6,5,1,1,1,0,1,0,13,4,7,8,20,,,,,,,,,,,,,,,,,,,,,,,\n"
    "1,1,5,1,1,1,0,4,6,5,1,1,1,0,1,0,13,4,7,8,20,,,,,,,,,,,,,,,,,,,,,,,\n"
    "1,1,5,1,1,1,0,5,6,4,1,0,1,0,1,0,4,8,18,20,,,,,,,,,,,,,,,,,,,,,,,,\n"
    "1,1,5,1,1,1,0,6,6,5,1,1,1,0,1,0,4,7,8,18,20,,,,,,,,,,,,,,,,,,,,,,,\n"
    "1,1,5,1,1,1,0,7,6,5,1,1,1,0,1,0,13,4,7,8,20,,,,,,,,,,,,,,,,,,,,,,,\n"
    "1,1,5,1,1,1,0,8,6,5,1,1,1,0,1,0,13,4,7,8,20,,,,,,,,,,,,,,,,,,,,,,,\n"
    "1,1,5,1,1,2,0,0,12,12,1,2,4,1,1,0,13,4,7,8,18,20,21,25,27,29,31,32,,,,,,,,,,,,,,,,\n";

#include <stdio.h>

#define SKIP_FIELD "%*[^,],"
#define DECIMAL_FIELD "%d,"

int read()
{
    int n;             /* bytes read - not needed for file or stdin */
    int sum = 0;       /* just to make sure results are used */

    for (char const *s = input;  *s;  ) {
        int nfields;
        int array[28];
        int m = sscanf(s,
                       /* field 0 is missing */
                       SKIP_FIELD SKIP_FIELD SKIP_FIELD
                       SKIP_FIELD SKIP_FIELD SKIP_FIELD
                       SKIP_FIELD SKIP_FIELD SKIP_FIELD
                       DECIMAL_FIELD /* field 10 */
                       SKIP_FIELD SKIP_FIELD SKIP_FIELD
                       SKIP_FIELD SKIP_FIELD SKIP_FIELD
                       "%n",
                       &nfields,
                       &n);
        if (m != 1) {
            return -1;
        }

        s += n;

        static const char fieldchars[] = DECIMAL_FIELD;
        static const size_t fieldsize = sizeof fieldchars - 1; /* ignore terminating null */

        static const char *const parse_entries =
            DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD
            DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD
            DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD
            DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD
            DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD
            DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD
            DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD
            "[^\n] ";
        const char *const line_parse = parse_entries + (28-nfields) * fieldsize;

        /* now read nfields (max 28) */
        m = sscanf(s,
                   line_parse,
                   &array[0], &array[1], &array[2], &array[3],
                   &array[4], &array[5], &array[6], &array[7],
                   &array[8], &array[9], &array[10], &array[11],
                   &array[12], &array[13], &array[14], &array[15],
                   &array[16], &array[17], &array[18], &array[19],
                   &array[20], &array[21], &array[22], &array[23],
                   &array[24], &array[25], &array[26], &array[27]);
        if (m != nfields) {
            return -1;
        }

        /* advance stream position */
        sscanf(s, "%*[^\n] %n", &n);  s += n;

        /* use the results */
        for (int i = 0;  i < nfields;  ++i) {
            sum += array[i];
        }
    }
    return sum;
}

#undef SKIP_FIELD
#undef DECIMAL_FIELD

int main()
{
    int sum = 0;
    for (int i = 0;  i < 1000000;  ++i) {
        sum += read() * (i&1 ? 1 : - 1); /* alternate add and subtract */
    }
    return sum != 0;
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM