简体   繁体   English

用C解析大文件

[英]Parsing Large File in C

For a class, I've been given the task of writing radix sort in parallel using pthreads, openmp, and MPI. 对于一个类,我已经被赋予了使用pthreads,openmp和MPI并行编写基数排序的任务。 My language of choice in this case is C -- I don't know C++ too well. 在这种情况下我选择的语言是C - 我不太了解C ++。

Anyways, the way I'm going about reading a text file is causing a segmentation fault at around 500MB file size. 无论如何,我正在阅读文本文件的方式导致大约500MB文件大小的分段错误。 The files are line separated 32 bit numbers: 这些文件是行分隔的32位数字:

12351
1235234
12
53421
1234

I know C, but I don't know it well; 我知道C,但我不太清楚; I use things I know, and in this case the things I know are terribly inefficient. 我使用我所知道的东西,在这种情况下,我所知道的事情非常低效。 My code for reading the text file is as follows: 我阅读文本文件的代码如下:

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <string.h>
#include <math.h>

int main(int argc, char **argv){

 if(argc != 4) {
   printf("rs_pthreads requires three arguments to run\n");
   return -1;
 }

 char *fileName=argv[1];
 uint32_t radixBits=atoi(argv[2]);
 uint32_t numThreads=atoi(argv[3]);

 if(radixBits > 32){
   printf("radixBitx cannot be greater than 32\n");
   return -1;
 }

 FILE *fileForReading = fopen( fileName, "r" );
 if(fileForReading == NULL){
   perror("Failed to open the file\n");
   return -1;
 }
 char* charBuff = malloc(1024);

 if(charBuff == NULL){
   perror("Error with malloc for charBuff");
   return -1;
 }

 uint32_t numNumbers = 0;
 while(fgetc(fileForReading) != EOF){
   numNumbers++;
   fgets(charBuff, 1024, fileForReading);
 }

 uint32_t numbersToSort[numNumbers];

 rewind(fileForReading);
 int location;
 for(location = 0; location < numNumbers; location++){
   fgets(charBuff, 1024, fileForReading);
   numbersToSort[location] = atoi(charBuff);
     } 

At a file of 50 million numbers (~500MB), I'm getting a segmentation fault at rewind of all places. 在一个5000万个数字(约500MB)的文件中,我在所有地方的倒带时遇到了分段错误。 My knowledge of how file streams work is almost non-existent. 我对文件流如何工作的了解几乎不存在。 My guess is it's trying to malloc without enough memory or something, but I don't know. 我的猜测是它试图没有足够的内存或其他东西,但我不知道。

So, I've got a two parter here: How is rewind segmentation faulting? 所以,我在这里有一个两个部分:如何重绕分段错误? Am I just doing a poor job before rewind and not checking some system call I should be? 我只是在倒带前做一份糟糕的工作,而不是检查一些系统调用我应该这样做吗?

And, what is a more efficient way to read in an arbitrary amount of numbers from a text file? 而且,从文本文件中读取任意数量的数字的更有效方法是什么?

Any help is appreciated. 任何帮助表示赞赏。

I think the most likely cause here is (ironically enough) a stack overflow . 我认为这里最可能的原因是(具有讽刺意味的是) 堆栈溢出 Your numbersToSort array is allocated on the stack, and the stack has a fixed size (varies by compiler and operating system, but 1 MB is a typical number). 您的numbersToSort数组在堆栈上分配,并且堆栈具有固定大小(因编译器和操作系统而异,但1 MB是典型数字)。 You should dynamically allocate numbersToSort on the heap (which has much more available space) using malloc() : 您应该使用malloc()在堆上(具有更多可用空间)动态分配numbersToSort

uint32_t *numbersToSort = malloc(sizeof(uint32_t) * numNumbers);

Don't forget to deallocate it later: 不要忘记以后解除分配:

free(numbersToSort);

I would also point out that your first-pass loop, which is intended to count the number of lines, will fail if there are any blank lines. 我还要指出,如果有任何空白行,那么用于计算行数的首通循环将失败。 This is because on a blank line, the first character is '\\n' , and fgetc() will consume it; 这是因为在空行上,第一个字符是'\\n'fgetc()将使用它; the next call to fgets() will then be reading the following line, and you'll have skipped the blank one in your count. 下一次调用fgets()将会读取以下行,并且您将跳过计数中的空白行。

The problem is in this line 问题出在这一行

uint32_t numbersToSort[numNumbers];

You are attempting to allocate a huge array in stack, your stack size is in few KBytes (Moreover older C standards don't allow this). 您正在尝试在堆栈中分配一个巨大的数组,您的堆栈大小为几KB(此外,较旧的C标准不允许这样)。 So you can try this 所以你可以试试这个

uint32_t *numbersToSort; /* Declare it with other declarations */


/* Remove uint32_t numbersToSort[numNumbers]; */
/* Add the code below */
numbersToSort = malloc(sizeof(uint32_t) * numNumbers);
if (!numbersToSort) {
     /* No memory; do cleanup and bail out */
     return 1;
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM