简体   繁体   English

将大文件中的数据读入C中的struct

[英]reading data from large file into struct in C

I am a beginner to C programming.我是 C 编程的初学者。 I need to efficiently read millions of from a file using struct in a file.我需要使用文件中的结构有效地从文件中读取数百万个。 Below is the example of input file.下面是输入文件的例子。

2,33.1609992980957,26.59000015258789,8.003999710083008
5,15.85200023651123,13.036999702453613,31.801000595092773
8,10.907999992370605,32.000999450683594,1.8459999561309814
11,28.3700008392334,31.650999069213867,13.107999801635742

I have a current code shown in below, it is giving an error "Error in file" suggesting the file is NULL but file has data.我有一个如下所示的当前代码,它给出了一个错误“文件中的错误”,表明文件为 NULL 但文件有数据。

#include<stdio.h>
#include<stdlib.h>

struct O_DATA
{
    int index;
    float x;
    float y;
    float z;
};

int main ()
{
    FILE *infile ;
    struct O_DATA input;
    infile = fopen("input.dat", "r");
    if (infile == NULL);
    {
            fprintf(stderr,"\nError file\n");
            exit(1);
    }
    while(fread(&input, sizeof(struct O_DATA), 1, infile))
            printf("Index = %d X= %f Y=%f Z=%f", input.index , input.x ,   input.y , input.z);
    fclose(infile);
    return 0;
}

I need to efficiently read and store data from an input file to process it further.我需要有效地从输入文件中读取和存储数据以进一步处理它。 Any help would be really appreciated.任何帮助将非常感激。 Thanks in advnace.预先感谢。 ~ ~
~ ~
~ ~

if (infile == NULL);
{ /* floating block */ }

The above if is a complete statement that does nothing regardless of the value of infile .上面的if是一个完整的语句,无论infile的值如何,它什么都不做。 The "floating" block is executed no matter what infile contains.无论infile包含什么,都会执行“浮动”块。
Remove the semicolon to 'attach' the "floating" block to the if删除分号以将“浮动”块“附加”到if

if (infile == NULL)
{ /* if block */ }

You've got an incorrect ;你有一个不正确的; after your if (infile == NULL) test - try removing that...在您的if (infile == NULL)测试之后 - 尝试删除它...

[Edit: 2nd by 9 secs! [编辑:第二乘 9 秒! :-)] :-)]

First figure out how to convert one line of text to data首先弄清楚如何将一行文本转换为数据

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

struct my_data
{
  unsigned int index;
  float x;
  float y;
  float z;
};



struct my_data *
deserialize_data(struct my_data *data, const char *input, const char *separators)
{
  char *p;                      
  struct my_data tmp;

  if(sscanf(input, "%d,%f,%f,%f", &data->index, &data->x, &data->y, &data->z) != 7)
    return NULL;
  return data;
}

 
 
 
 
  
  
  deserialize_data(struct my_data *data, const char *input, const char *separators) { char *p; struct my_data tmp; char *str = strdup(input); /* make a copy of the input line because we modify it */ if (!str) { /* I couldn't make a copy so I'll die */ return NULL; } p = strtok (str, separators); /* use line for first call to strtok */ if (!p) goto err; tmp.index = strtoul (p, NULL, 0); /* convert text to integer */ p = strtok (NULL, separators); /* strtok remembers line */ if (!p) goto err; tmp.x = atof(p); p = strtok (NULL, separators); if (!p) goto err; tmp.y = atof(p); p = strtok (NULL, separators); if (!p) goto err; tmp.z = atof(p); memcpy(data, &tmp, sizeof(tmp)); /* copy values out */ goto out; err: data = NULL; out: free (str); return data; }
 
 
 

int main() {
    struct my_data somedata;
    deserialize_data(&somedata, "1,2.5,3.12,7.955", ",");
    printf("index: %d, x: %2f, y: %2f, z: %2f\n", somedata.index, somedata.x, somedata.y, somedata.z);
}

Combine it with reading lines from a file:将它与从文件中读取行结合起来:

just the main function here (insert the rest from the previous example)只是这里的主要功能(插入上一个示例中的其余部分)

   int
   main(int argc, char *argv[])
   {
       FILE *stream;
       char *line = NULL;
       size_t len = 0;
       ssize_t nread;
       struct my_data somedata;

       if (argc != 2) {
           fprintf(stderr, "Usage: %s <file>\n", argv[0]);
           exit(EXIT_FAILURE);
       }

       stream = fopen(argv[1], "r");
       if (stream == NULL) {
           perror("fopen");
           exit(EXIT_FAILURE);
       }

       while ((nread = getline(&line, &len, stream)) != -1) {
           deserialize_data(&somedata, line, ",");
           printf("index: %d, x: %2f, y: %2f, z: %2f\n", somedata.index, somedata.x, somedata.y, somedata.z);
       }

       free(line);
       fclose(stream);
       exit(EXIT_SUCCESS);
   }

You already have solid responses in regard to syntax/structs/etc, but I will offer another method for reading the data in the file itself: I like Martin York's CSVIterator solution.您已经对语法/结构/等有了可靠的回应,但我将提供另一种读取文件本身数据的方法:我喜欢 Martin York 的CSVIterator解决方案。 This is my go-to approach for CSV processing because it requires less code to implement and has the added benefit of being easily modifiable (ie, you can edit the CSVRow and CSVIterator defs depending on your needs).这是我处理 CSV 的首选方法,因为它需要更少的代码来实现,并且具有易于修改的额外好处(即,您可以根据需要编辑 CSVRow 和 CSVIterator 定义)。

Here's a mostly complete example using Martin's unedited code without structs or classes.这是一个几乎完整的示例,使用 Martin 未编辑的代码,没有结构或类。 In my opinion, and especially so as a beginner, it is easier to start developing your code with simpler techniques.在我看来,尤其是作为初学者,使用更简单的技术开始开发代码会更容易。 As your code begins to take shape, it is much clearer why and where you need to implement more abstract/advanced devices.随着您的代码开始成形,您需要实现更多抽象/高级设备的原因和位置变得更加清晰。

Note this would technically need to be compiled with C++11 or greater because of my use of std::stod (and maybe some other stuff too I am forgetting), so take that into consideration:请注意,这在技术上需要使用 C++11 或更高版本编译,因为我使用了std::stod (也许还有一些我忘记的其他东西),所以请考虑这一点:

//your includes
//...
#include"wherever_CSVIterator_is.h"

int main (int argc, char* argv[]) 
{
  int index;
  double tmp[3]; //since we know the shape of your input data
  std::vector<double*> saved = std::vector<double*>();
  std::vector<int> indices;

  std::ifstream file(argv[1]);
  for (CSVIterator loop(file); loop != CSVIterator(); ++loop) { //loop over rows
    index = (*loop)[0]; 
    indices.push_back(index); //store int index first, always col 0
    for (int k=1; k < (*loop).size(); k++) {                    //loop across columns
       tmp[k-1] = std::stod((*loop)[k]); //save double values now
    }
    saved.push_back(tmp);
  }

 /*now we have two vectors of the same 'size'
  (let's pretend I wrote a check here to confirm this is true), 
  so we loop through them together and access with something like:*/

  for (int j=0; j < (int)indices.size(); j++) {
    double* saved_ptr = saved.at(j); //get pointer to first elem of each triplet
    printf("\nindex: %g |", indices.at(j));
    for (int k=0; k < 3; k++) {
      printf(" %4.3f ", saved_ptr[k]);
    }
    printf("\n");
  }
}

Less fuss to write, but more dangerous (if saved[] goes out of scope, we are in trouble).写起来不那么麻烦,但更危险(如果saved[] 超出范围,我们就有麻烦了)。 Also some unnecessary copying is present, but we benefit from using std::vector containers in lieu of knowing exactly how much memory we need to allocate.还存在一些不必要的复制,但我们受益于使用 std::vector 容器而不是确切知道我们需要分配多少内存。

Don't give an example of input file.不要给出输入文件的例子 Specify your input file format -at least on paper or in comments- eg inEBNF notation (since your example is textual ... it is not a binary file ).指定您的输入文件格式- 至少在纸上或在评论中 - 例如在EBNF符号中(因为您的示例是文本...它不是二进制文件)。 Decide if the numbers have to be in different lines (or if you might accept a file with a single huge line made of million bytes; read about the Comma Separated Values format).决定这些数字是否必须在不同的行中(或者您是否可以接受一个由百万字节组成的单行大行的文件;阅读逗号分隔值格式)。 Then, code some parser for that format.然后,为该格式编写一些解析器 In your case, it is likely that some very simple recursive descent parsing is enough (and your particular parser won't even use recursion ).在您的情况下,一些非常简单的递归下降解析可能就足够了(并且您的特定解析器甚至不会使用recursion )。

Read more about <stdio.h> and its routines .阅读有关<stdio.h>及其例程的更多信息。 Take time to carefully read that documentation.花时间仔细阅读该文档。 Since your input is textual , not binary , you don't need fread .由于您的输入是textual ,而不是binary ,因此您不需要fread Notice that input routines can fail, and you should handle the failure case.请注意,输入例程可能会失败,您应该处理失败的情况。

Of course, fopen can fail (eg because your working directory is not what you believe it is).当然, fopen可能会失败(例如,因为您的工作目录不是您认为的那样)。 You'll better use perror or errno to find more about the failure cause.您最好使用perrorerrno来查找有关失败原因的更多信息。 So at least code:所以至少代码:

infile = fopen("input.dat", "r");
if (infile == NULL) {
  perror("fopen input.dat");
  exit(EXIT_FAILURE);
}

Notice that semi-colons (or their absence) are very important in C (no semi-colon after condition of if ).请注意,分号(或没有分号)在 C 中非常重要(在if条件之后没有分号)。 Read again the basic syntax of C language .再读一遍C语言的基本语法。 Read aboutHow to debug small programs .阅读如何调试小程序 Enable all warnings and debug info when compiling (with GCC , compile with gcc -Wall -g at least).编译时启用所有警告和调试信息(使用GCC ,至少使用gcc -Wall -g编译)。 The compiler warnings are very useful!编译器警告非常有用!

Remember that fscanf don't handle the end of line (newline) differently from a space character.请记住, fscanf不会以与空格字符不同的方式处理行尾(换行符)。 So if the input has to have different lines you need to read every line separately.因此,如果输入必须具有不同的,则需要分别读取每一行。

You'll probably read every line using fgets (or getline ) and parse every line individually.您可能会使用fgets (或getline )读取每一并单独解析每一行。 You could do that parsing with the help of sscanf (perhaps the %n could be useful) - and you want to use the return count of sscanf .您可以在sscanf的帮助下进行解析(也许%n可能有用)-并且您想使用sscanf的返回计数。 You could also perhaps use strtok and/or strtod to do such a parsing.您也可以使用strtok和/或strtod进行这样的解析。

Make sure that your parsing and your entire program is correct .确保您的解析和整个程序是正确的 With current computers (they are very fast, and most of the time your input file sits in the page cache ) it is very likely that it would be fast enough.使用当前的计算机(它们非常快,并且大部分时间您的输入文件位于页面缓存中)很可能它已经足够快了。 A million lines can be read pretty quickly (if on Linux, you could compare your parsing time with the time used by wc to count the lines of your file).可以很快读取一百万行(如果在 Linux 上,您可以将解析时间与wc用于计算文件行数的时间进行比较)。 On my computer (a powerful Linux desktop with AMD2970WX processor -it has lots of cores, but your program uses only one-, 64Gbytes of RAM, and SSD disk) a million lines can be read (by wc ) in less than 30 milliseconds, so I am guessing your entire program should run in less than half a second, if given a million lines of input, and if the further processing is simple (in linear time).在我的计算机(带有 AMD2970WX 处理器的强大 Linux 台式机 - 它有很多内核,但你的程序只使用一个 - 64GB 的 RAM 和 SSD 磁盘)可以在不到 30 毫秒的时间内(通过wc )读取一百万行,所以我猜你的整个程序应该在不到半秒的时间内运行,如果输入一百万行,并且进一步的处理很简单(在线性时间内)。

You are likely to fill a large array of struct O_DATA and that array should probably be dynamically allocated, and reallocated when needed.您可能会填充一个大的struct O_DATA数组,并且该数组可能应该动态分配,并在需要时重新分配。 Read more about C dynamic memory allocation .阅读有关C 动态内存分配的更多信息。 Read carefully about C memory management routines .仔细阅读C 内存管理例程 They could fail, and you need to handle that failure (even if it is very unlikely to happen).他们可能会失败,而您需要处理这种失败(即使它不太可能发生)。 You certainly don't want to re-allocate that array at every loop.您当然不想在每个循环中重新分配该数组。 You probably could allocate it in some geometrical progression (eg if the size of that array is size , you'll call realloc or a new malloc for some int newsize = 4*size/3 + 10; only when the old size is too small).您可能可以以某种几何级数分配它(例如,如果该数组的sizesize ,您将调用realloc或新的malloc以获取某些int newsize = 4*size/3 + 10;仅当旧size太小时)。 Of course, your array will generally be a bit larger than what is really needed, but memory is quite cheap and you are allowed to "lose" some of it.当然,您的数组通常会比实际需要的数组大一点,但是内存非常便宜,您可以“丢失”其中的一些。

But StackOverflow is not a "do my homework" site.但 StackOverflow不是“做我的功课”网站。 I gave some advice above, but you should do your homework.我在上面给出了一些建议,但你应该做你的功课。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM