如何在C中逐行读取10 GB的txt文件，该文件由制表符分隔的双精度数据组成

Question

I have a txt file consisting of tab-separated data with type double . 我有一个txt文件，该文件由制表符分隔的数据（类型为double 。 The data file is over 10 GB, so I just wish to read the data line-by-line and then do some processing. 数据文件超过10 GB，因此我只希望逐行读取数据，然后进行一些处理。 Particularly, the data is layout as an matrix with, say 1001 columns, and millions of rows. 特别地，数据是作为具有1001列和数百万行的矩阵布置的。 Below is just a fake sample to show the layout. 下面只是一个伪造的示例，用于显示布局。

10.2  30.4  42.9 ... 3232.000 23232.45
...
...
7.234  824.23232 ... 4009.23  230.01
...

For each line I'd like to store the first 1000 values in an array, and the last value in a separate variable. 对于每一行，我想将前1000个值存储在数组中，并将最后一个值存储在单独的变量中。 I am new to C, so it would be nice if you could kindly point out major steps. 我是C的新手，如果您能指出主要步骤，那将是很好的。

Update: 更新：

Thanks for all valuable suggestions and solutions. 感谢您提出的所有宝贵建议和解决方案。 I just figured out one simple example where I just read a 3-by-4 matrix row by row from a txt file. 我只是想出一个简单的示例，我只是从txt文件中逐行读取3×4矩阵。 For each row, the first 3 elements are stored in x , and the last element is stored in vector y . 对于每一行，前3个元素存储在x ，最后一个元素存储在向量y 。 So x is a n-by-p matrix with n=p=3 , y is a 1-by-3 vector. 因此x是n=p=3 n-by-p矩阵， y是1-by-3向量。

Below is my data file and my code. 以下是我的数据文件和代码。

Data file: 资料档案：

1.112272    -0.345324   0.608056    0.641006
-0.358203   0.300349    -1.113812   -0.321359
0.155588    2.081781    0.038588    -0.562489

My code: 我的代码：

#include<math.h>
#include <stdlib.h>
#include<stdio.h>
#include <string.h>

#define n 3
#define p 3

void main() {

    FILE *fpt;
    fpt = fopen("./data_temp.txt", "r");    

    char line[n*(p+1)*sizeof(double)];
    char *token;
    double *x;
    x = malloc(n*p*sizeof(double));
    double y[n];

    int index = 0;
    int xind = 0;
    int yind = 0;

    while(fgets(line, sizeof(line), fpt)) {
        //printf("%d\n", sizeof(line));
        //printf("%s\n", line);

        token = strtok(line, "\t");
        while(token != NULL) {
            printf("%s\n", token);

            if((index+1) % (p+1) == 0) { // the last element in each line;
                yind = (index + 1) / (p+1) - 1; // get index for y vector;
                sscanf(token, "%lf", &(y[yind]));
            } else {
                sscanf(token, "%lf", &(x[xind]));
                xind++;
            }
            //sscanf(token, "%lf", &(x[index]));
            index++;
            token = strtok(NULL, "\t");
        } 
    }

    int i = 0;
    int j = 0;
    puts("Print x matrix:");
    for(i = 0; i < n*p; i++) {
        printf("%f\n", x[i]);
    }
    printf("\n");

    puts("Print y vector:");
    for(j = 0; j < n; j++) {
        printf("%f\t", y[j]);
    }
    printf("\n");
    free(x);
    fclose(fpt);
}

With above, hopefully things will work if I replace data_temp.txt with my raw 10 GB data file (of course change values of n , p , and some other code wherever necessary.) 有了上述内容，如果我用原始的10 GB数据文件替换data_temp.txt （希望在必要时更改n ， p和一些其他代码的值），则希望一切正常。

I have additional questions that I wish if you could help me. 如果您能帮助我，我还有其他问题。

I first initialized char line[] as char line[(p+1)*sizeof(double)] (note not multiplying n ). 我首先将char line[]初始化为char line[(p+1)*sizeof(double)] （注意不要乘n ）。 But the line cannot be read completely. 但是该行无法完全读取。 How could I assign memory JUST for one single line? 我如何只为一行分配内存？ What's the lenght? 长度是多少？ I assume it's (p+1)*sizeof(double) since there are (p+1) doubles in each line. 我假设它是(p+1)*sizeof(double)因为每行有(p+1) double。 Should I also assign memory for \\t and \\n ? 我还应该为\\t和\\n分配内存吗？ If so, how? 如果是这样，怎么办？
Does the code look reasonable to you? 代码对您来说看起来合理吗？ How could I make it more efficient since this code will be executed over millions of rows? 由于此代码将在数百万行中执行，因此如何提高效率？
If I don't know the number of columns or rows in the raw 10 GB file, how could I quickly count rows and columns? 如果我不知道原始10 GB文件中的列数或行数，如何快速计算行数和列数？

Again I am new to C , any comments are very appreciated. 同样，我是C新手，非常感谢任何评论。 Thanks a lot! 非常感谢！

Answer 1

1st way 第一种方式

Read file in chunks into preallocated buffer using fread . 使用fread文件大块读取到预分配的缓冲区中。

2nd way 第二路

Map the file into your process memory space using mmap , move the pointer then over the file. 使用mmap将文件映射到您的进程内存空间，然后将指针移到文件上方。

Answer 2

3rd way 第三种方式

Since your file is delimited by lines, open the file with fopen , use setvbuf or similar to set a buffer size greater than about 10 lines or so, then read the file line-by-line using fgets . 由于文件由行分隔，因此请使用fopen打开文件，使用setvbuf或类似方法将缓冲区大小设置为大约10行左右，然后使用fgets逐行读取文件。

To potentially read the file even faster, use open with O_DIRECT (assuming Linux), then use fdopen to get a FILE * for the open file, then use setvbuf to set a page-aligned buffer. 潜在读取文件速度更快，使用open与O_DIRECT （假设Linux的），然后用fdopen得到一个FILE *的打开文件，然后使用setvbuf设置页对齐缓冲区。 Doing that will allow you to bypass the kernel page cache - if your system's implementation works successfully using direct IO that way. 这样做将允许您绕过内核页面缓存-如果您的系统实现使用直接IO成功地以这种方式工作。 (There can be many restrictions to direct IO) （直接IO可能有很多限制）

Answer 3

Something to get you started: Reading 1 line 入门指南：阅读1行

#define COLUMN (1000+1)
double data[COLUMNS];

for (int i = 0; i< COLUMN; i++) {
  char delim = '\n';
  int cnt = fscanf(in_stream, "%lf%c", &data[i], &delim);
  if (cnt < 1) {
    if (cnt == EOF && i == 0) return 0; // None read, OK as end of file
    puts("Missing or bad data");
    return -1; // problem
  }
  if (delim  != '\t') {
    // If tab not found, should be at end of line
    if (delim  == '\n' && i == COLUMN-1) {
      return COLUMN;  // Success
    } 
    puts("Bad delimiter");
    return -1;
  }
}
puts("Extra data");
return -1;

如何在C中逐行读取10 GB的txt文件，该文件由制表符分隔的双精度数据组成

问题描述

3 个解决方案

解决方案1
1 2015-09-10 21:37:22

解决方案2
0 2015-09-10 22:43:36

解决方案3
0 2015-09-11 02:35:29

如何在C中逐行读取10 GB的txt文件，该文件由制表符分隔的双精度数据组成

问题描述

3 个解决方案

解决方案1 1 2015-09-10 21:37:22

解决方案2 0 2015-09-10 22:43:36

解决方案3 0 2015-09-11 02:35:29

解决方案1
1 2015-09-10 21:37:22

解决方案2
0 2015-09-10 22:43:36

解决方案3
0 2015-09-11 02:35:29