简体   繁体   中英

reading data from large file into struct in C

I am a beginner to C programming. I need to efficiently read millions of from a file using struct in a file. Below is the example of input file.

2,33.1609992980957,26.59000015258789,8.003999710083008
5,15.85200023651123,13.036999702453613,31.801000595092773
8,10.907999992370605,32.000999450683594,1.8459999561309814
11,28.3700008392334,31.650999069213867,13.107999801635742

I have a current code shown in below, it is giving an error "Error in file" suggesting the file is NULL but file has data.

#include<stdio.h>
#include<stdlib.h>

struct O_DATA
{
    int index;
    float x;
    float y;
    float z;
};

int main ()
{
    FILE *infile ;
    struct O_DATA input;
    infile = fopen("input.dat", "r");
    if (infile == NULL);
    {
            fprintf(stderr,"\nError file\n");
            exit(1);
    }
    while(fread(&input, sizeof(struct O_DATA), 1, infile))
            printf("Index = %d X= %f Y=%f Z=%f", input.index , input.x ,   input.y , input.z);
    fclose(infile);
    return 0;
}

I need to efficiently read and store data from an input file to process it further. Any help would be really appreciated. Thanks in advnace. ~
~
~

if (infile == NULL);
{ /* floating block */ }

The above if is a complete statement that does nothing regardless of the value of infile . The "floating" block is executed no matter what infile contains.
Remove the semicolon to 'attach' the "floating" block to the if

if (infile == NULL)
{ /* if block */ }

You've got an incorrect ; after your if (infile == NULL) test - try removing that...

[Edit: 2nd by 9 secs! :-)]

First figure out how to convert one line of text to data

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

struct my_data
{
  unsigned int index;
  float x;
  float y;
  float z;
};



struct my_data *
deserialize_data(struct my_data *data, const char *input, const char *separators)
{
  char *p;                      
  struct my_data tmp;

  if(sscanf(input, "%d,%f,%f,%f", &data->index, &data->x, &data->y, &data->z) != 7)
    return NULL;
  return data;
}

 
 
 
 
  
  
  deserialize_data(struct my_data *data, const char *input, const char *separators) { char *p; struct my_data tmp; char *str = strdup(input); /* make a copy of the input line because we modify it */ if (!str) { /* I couldn't make a copy so I'll die */ return NULL; } p = strtok (str, separators); /* use line for first call to strtok */ if (!p) goto err; tmp.index = strtoul (p, NULL, 0); /* convert text to integer */ p = strtok (NULL, separators); /* strtok remembers line */ if (!p) goto err; tmp.x = atof(p); p = strtok (NULL, separators); if (!p) goto err; tmp.y = atof(p); p = strtok (NULL, separators); if (!p) goto err; tmp.z = atof(p); memcpy(data, &tmp, sizeof(tmp)); /* copy values out */ goto out; err: data = NULL; out: free (str); return data; }
 
 
 

int main() {
    struct my_data somedata;
    deserialize_data(&somedata, "1,2.5,3.12,7.955", ",");
    printf("index: %d, x: %2f, y: %2f, z: %2f\n", somedata.index, somedata.x, somedata.y, somedata.z);
}

Combine it with reading lines from a file:

just the main function here (insert the rest from the previous example)

   int
   main(int argc, char *argv[])
   {
       FILE *stream;
       char *line = NULL;
       size_t len = 0;
       ssize_t nread;
       struct my_data somedata;

       if (argc != 2) {
           fprintf(stderr, "Usage: %s <file>\n", argv[0]);
           exit(EXIT_FAILURE);
       }

       stream = fopen(argv[1], "r");
       if (stream == NULL) {
           perror("fopen");
           exit(EXIT_FAILURE);
       }

       while ((nread = getline(&line, &len, stream)) != -1) {
           deserialize_data(&somedata, line, ",");
           printf("index: %d, x: %2f, y: %2f, z: %2f\n", somedata.index, somedata.x, somedata.y, somedata.z);
       }

       free(line);
       fclose(stream);
       exit(EXIT_SUCCESS);
   }

You already have solid responses in regard to syntax/structs/etc, but I will offer another method for reading the data in the file itself: I like Martin York's CSVIterator solution. This is my go-to approach for CSV processing because it requires less code to implement and has the added benefit of being easily modifiable (ie, you can edit the CSVRow and CSVIterator defs depending on your needs).

Here's a mostly complete example using Martin's unedited code without structs or classes. In my opinion, and especially so as a beginner, it is easier to start developing your code with simpler techniques. As your code begins to take shape, it is much clearer why and where you need to implement more abstract/advanced devices.

Note this would technically need to be compiled with C++11 or greater because of my use of std::stod (and maybe some other stuff too I am forgetting), so take that into consideration:

//your includes
//...
#include"wherever_CSVIterator_is.h"

int main (int argc, char* argv[]) 
{
  int index;
  double tmp[3]; //since we know the shape of your input data
  std::vector<double*> saved = std::vector<double*>();
  std::vector<int> indices;

  std::ifstream file(argv[1]);
  for (CSVIterator loop(file); loop != CSVIterator(); ++loop) { //loop over rows
    index = (*loop)[0]; 
    indices.push_back(index); //store int index first, always col 0
    for (int k=1; k < (*loop).size(); k++) {                    //loop across columns
       tmp[k-1] = std::stod((*loop)[k]); //save double values now
    }
    saved.push_back(tmp);
  }

 /*now we have two vectors of the same 'size'
  (let's pretend I wrote a check here to confirm this is true), 
  so we loop through them together and access with something like:*/

  for (int j=0; j < (int)indices.size(); j++) {
    double* saved_ptr = saved.at(j); //get pointer to first elem of each triplet
    printf("\nindex: %g |", indices.at(j));
    for (int k=0; k < 3; k++) {
      printf(" %4.3f ", saved_ptr[k]);
    }
    printf("\n");
  }
}

Less fuss to write, but more dangerous (if saved[] goes out of scope, we are in trouble). Also some unnecessary copying is present, but we benefit from using std::vector containers in lieu of knowing exactly how much memory we need to allocate.

Don't give an example of input file. Specify your input file format -at least on paper or in comments- eg inEBNF notation (since your example is textual ... it is not a binary file ). Decide if the numbers have to be in different lines (or if you might accept a file with a single huge line made of million bytes; read about the Comma Separated Values format). Then, code some parser for that format. In your case, it is likely that some very simple recursive descent parsing is enough (and your particular parser won't even use recursion ).

Read more about <stdio.h> and its routines . Take time to carefully read that documentation. Since your input is textual , not binary , you don't need fread . Notice that input routines can fail, and you should handle the failure case.

Of course, fopen can fail (eg because your working directory is not what you believe it is). You'll better use perror or errno to find more about the failure cause. So at least code:

infile = fopen("input.dat", "r");
if (infile == NULL) {
  perror("fopen input.dat");
  exit(EXIT_FAILURE);
}

Notice that semi-colons (or their absence) are very important in C (no semi-colon after condition of if ). Read again the basic syntax of C language . Read aboutHow to debug small programs . Enable all warnings and debug info when compiling (with GCC , compile with gcc -Wall -g at least). The compiler warnings are very useful!

Remember that fscanf don't handle the end of line (newline) differently from a space character. So if the input has to have different lines you need to read every line separately.

You'll probably read every line using fgets (or getline ) and parse every line individually. You could do that parsing with the help of sscanf (perhaps the %n could be useful) - and you want to use the return count of sscanf . You could also perhaps use strtok and/or strtod to do such a parsing.

Make sure that your parsing and your entire program is correct . With current computers (they are very fast, and most of the time your input file sits in the page cache ) it is very likely that it would be fast enough. A million lines can be read pretty quickly (if on Linux, you could compare your parsing time with the time used by wc to count the lines of your file). On my computer (a powerful Linux desktop with AMD2970WX processor -it has lots of cores, but your program uses only one-, 64Gbytes of RAM, and SSD disk) a million lines can be read (by wc ) in less than 30 milliseconds, so I am guessing your entire program should run in less than half a second, if given a million lines of input, and if the further processing is simple (in linear time).

You are likely to fill a large array of struct O_DATA and that array should probably be dynamically allocated, and reallocated when needed. Read more about C dynamic memory allocation . Read carefully about C memory management routines . They could fail, and you need to handle that failure (even if it is very unlikely to happen). You certainly don't want to re-allocate that array at every loop. You probably could allocate it in some geometrical progression (eg if the size of that array is size , you'll call realloc or a new malloc for some int newsize = 4*size/3 + 10; only when the old size is too small). Of course, your array will generally be a bit larger than what is really needed, but memory is quite cheap and you are allowed to "lose" some of it.

But StackOverflow is not a "do my homework" site. I gave some advice above, but you should do your homework.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM