[英]How to read a 10 GB txt file consisting of tab-separated double data line by line in C
I have a txt file consisting of tab-separated data with type double
. 我有一个txt文件,该文件由制表符分隔的数据(类型为
double
。 The data file is over 10 GB, so I just wish to read the data line-by-line and then do some processing. 数据文件超过10 GB,因此我只希望逐行读取数据,然后进行一些处理。 Particularly, the data is layout as an matrix with, say 1001 columns, and millions of rows.
特别地,数据是作为具有1001列和数百万行的矩阵布置的。 Below is just a fake sample to show the layout.
下面只是一个伪造的示例,用于显示布局。
10.2 30.4 42.9 ... 3232.000 23232.45
...
...
7.234 824.23232 ... 4009.23 230.01
...
For each line I'd like to store the first 1000 values in an array, and the last value in a separate variable. 对于每一行,我想将前1000个值存储在数组中,并将最后一个值存储在单独的变量中。 I am new to C, so it would be nice if you could kindly point out major steps.
我是C的新手,如果您能指出主要步骤,那将是很好的。
Update: 更新:
Thanks for all valuable suggestions and solutions. 感谢您提出的所有宝贵建议和解决方案。 I just figured out one simple example where I just read a 3-by-4 matrix row by row from a txt file.
我只是想出一个简单的示例,我只是从txt文件中逐行读取3×4矩阵。 For each row, the first 3 elements are stored in
x
, and the last element is stored in vector y
. 对于每一行,前3个元素存储在
x
,最后一个元素存储在向量y
。 So x
is a n-by-p
matrix with n=p=3
, y
is a 1-by-3
vector. 因此
x
是n=p=3
n-by-p
矩阵, y
是1-by-3
向量。
Below is my data file and my code. 以下是我的数据文件和代码。
Data file: 资料档案:
1.112272 -0.345324 0.608056 0.641006
-0.358203 0.300349 -1.113812 -0.321359
0.155588 2.081781 0.038588 -0.562489
My code: 我的代码:
#include<math.h>
#include <stdlib.h>
#include<stdio.h>
#include <string.h>
#define n 3
#define p 3
void main() {
FILE *fpt;
fpt = fopen("./data_temp.txt", "r");
char line[n*(p+1)*sizeof(double)];
char *token;
double *x;
x = malloc(n*p*sizeof(double));
double y[n];
int index = 0;
int xind = 0;
int yind = 0;
while(fgets(line, sizeof(line), fpt)) {
//printf("%d\n", sizeof(line));
//printf("%s\n", line);
token = strtok(line, "\t");
while(token != NULL) {
printf("%s\n", token);
if((index+1) % (p+1) == 0) { // the last element in each line;
yind = (index + 1) / (p+1) - 1; // get index for y vector;
sscanf(token, "%lf", &(y[yind]));
} else {
sscanf(token, "%lf", &(x[xind]));
xind++;
}
//sscanf(token, "%lf", &(x[index]));
index++;
token = strtok(NULL, "\t");
}
}
int i = 0;
int j = 0;
puts("Print x matrix:");
for(i = 0; i < n*p; i++) {
printf("%f\n", x[i]);
}
printf("\n");
puts("Print y vector:");
for(j = 0; j < n; j++) {
printf("%f\t", y[j]);
}
printf("\n");
free(x);
fclose(fpt);
}
With above, hopefully things will work if I replace data_temp.txt
with my raw 10 GB data file (of course change values of n
, p
, and some other code wherever necessary.) 有了上述内容,如果我用原始的10 GB数据文件替换
data_temp.txt
(希望在必要时更改n
, p
和一些其他代码的值),则希望一切正常。
I have additional questions that I wish if you could help me. 如果您能帮助我,我还有其他问题。
char line[]
as char line[(p+1)*sizeof(double)]
(note not multiplying n
). char line[]
初始化为char line[(p+1)*sizeof(double)]
(注意不要乘n
)。 But the line cannot be read completely. (p+1)*sizeof(double)
since there are (p+1)
doubles in each line. (p+1)*sizeof(double)
因为每行有(p+1)
double。 Should I also assign memory for \\t
and \\n
? \\t
和\\n
分配内存吗? If so, how? Again I am new to C
, any comments are very appreciated. 同样,我是
C
新手,非常感谢任何评论。 Thanks a lot! 非常感谢!
1st way 第一种方式
Read file in chunks into preallocated buffer using fread
. 使用
fread
文件大块读取到预分配的缓冲区中。
2nd way 第二路
Map the file into your process memory space using mmap
, move the pointer then over the file. 使用
mmap
将文件映射到您的进程内存空间,然后将指针移到文件上方。
3rd way 第三种方式
Since your file is delimited by lines, open the file with fopen
, use setvbuf
or similar to set a buffer size greater than about 10 lines or so, then read the file line-by-line using fgets
. 由于文件由行分隔,因此请使用
fopen
打开文件,使用setvbuf
或类似方法将缓冲区大小设置为大约10行左右,然后使用fgets
逐行读取文件。
To potentially read the file even faster, use open
with O_DIRECT
(assuming Linux), then use fdopen
to get a FILE *
for the open file, then use setvbuf
to set a page-aligned buffer. 潜在读取文件速度更快,使用
open
与O_DIRECT
(假设Linux的),然后用fdopen
得到一个FILE *
的打开文件,然后使用setvbuf
设置页对齐缓冲区。 Doing that will allow you to bypass the kernel page cache - if your system's implementation works successfully using direct IO that way. 这样做将允许您绕过内核页面缓存-如果您的系统实现使用直接IO成功地以这种方式工作。 (There can be many restrictions to direct IO)
(直接IO可能有很多限制)
Something to get you started: Reading 1 line 入门指南:阅读1行
#define COLUMN (1000+1)
double data[COLUMNS];
for (int i = 0; i< COLUMN; i++) {
char delim = '\n';
int cnt = fscanf(in_stream, "%lf%c", &data[i], &delim);
if (cnt < 1) {
if (cnt == EOF && i == 0) return 0; // None read, OK as end of file
puts("Missing or bad data");
return -1; // problem
}
if (delim != '\t') {
// If tab not found, should be at end of line
if (delim == '\n' && i == COLUMN-1) {
return COLUMN; // Success
}
puts("Bad delimiter");
return -1;
}
}
puts("Extra data");
return -1;
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.