简体   繁体   English

用C快速处理文本文件

[英]Fast Processing of Text Files in C

I've written some C code (not a C pro 'though), which is supposed to be as fast as possible. 我写了一些C代码(虽然不是C pro),但应该尽可能快。 The algorithm is finished and I'm pleased with it's speed. 该算法已完成,我对它的速度感到满意。 But before it starts, I have to get some information from a text file, which is way to slow. 但是在开始之前,我必须从文本文件中获取一些信息,这很慢。

Right now the processing of the text file needs about 3 seconds for bigger files, while the same file is processed by Java code in less than 1 second, because Java has premade methods like readline() in it's framwork which alone contains more than 100 lines of pure code. 现在,对于较大的文件,文本文件的处理大约需要3秒钟,而用Java代码在不到1秒的时间内处理同一个文件,因为Java在其框架中预先编写了诸如readline()之类的方法,仅此方法就包含100多行纯代码。

Is there any comparable Framework for C? 是否有任何可比的C框架? I couldn't find anything on Google, because no matter how I rephrased my search requests I would get nothing, but tutorials on how to user fopen()... 我在Google上找不到任何东西,因为无论我如何改写搜索请求,我都一无所获,但有关于如何使用fopen()的教程...

If you wonder why I don't use Java then: The algorithm itself is way faster in C. 如果您想知道为什么我不使用Java,那么:该算法本身在C语言中速度更快。

Here is the code I use in C. What needs to be done is to process a .cnf file in DINMACS format. 这是我在C语言中使用的代码。需要做的是处理DINMACS格式的.cnf文件。

    while ((temp = fgetc(fp)) != EOF)
    {   
        if (temp == 'c')
        {
            //evtl. im Labor auf 13 ändern
            while ((temp =fgetc(fp)) != 10 && temp != EOF);
        }

        if (temp == 'p')
        {
            while ((temp =fgetc(fp)) < '0' ||  temp > '9');

            while (temp != 32)
            {
                variablen= (variablen * 10) + (temp - '0');
                temp=fgetc(fp);

            }

            while ((temp =fgetc(fp)) < '0' ||  temp > '9');

            while ((temp!= 32) && (temp != 10 ) )
            {
                klauseln= (klauseln * 10) + (temp - '0');
                temp=fgetc(fp);
            }

            while ((temp != 10) && (temp != EOF))
            {
                temp=fgetc(fp);
            }

            break;
        }
    }

    phi = (int *) malloc(klauseln * variablen * sizeof(int));

    int zaehler2 = 0;
    for (int j = 0; j < klauseln; ++j)
    {
        for (int i = 0; i < variablen; ++i)
        {
            phi[zaehler2++] = 0;
        }
    }

    int zeile = 0;

    while ((temp = fgetc(fp)) != EOF)
    {   
        if (temp == 'c')
        {
            while ((temp =fgetc(fp)) != 10 && temp != EOF);
        }
        else
        {
            while (temp != '0')
            {                        
                    int neg = 1;
                    int wert = 0;

                    while (temp != 32)
                    {
                        if (temp == '-') 
                        {
                            neg = -1;
                        }
                        else
                        {
                            wert = (wert * 10) + (temp - '0');
                        }

                        temp = fgetc(fp);
                    }
                    phi[wert - 1 + zeile] = neg;
                    temp = fgetc(fp);    
            }

            zeile = zeile + variablen;
            temp = fgetc(fp);    
        }
    }

To speed up code, you first check to see if there's a better algorithm. 为了加快代码速度,您首先要检查是否有更好的算法。

There is nothing algorithmically wrong. 算法上没有错。 You're processing each character, in sequence, without backtracking, so it's O(n), which is as good as you could expect. 您正在按顺序处理每个字符,而没有回溯,所以它是O(n),与您期望的一样好。

So all you can do is try to find faster ways to do what you're already doing. 因此,您所能做的就是尝试找到更快的方法来做自己已经做的事情。 To do that, you need to profile the code. 为此,您需要分析代码。 You can't know where the time is being spent otherwise. 否则您将不知道在哪里花费时间。 If you don't know the most biggest bottleneck, you'll waste a lot of time trying to optimize the wrong spot. 如果您不了解最大的瓶颈,那么您将浪费大量时间来尝试优化错误的位置。

It's possible that reading the file character by character is slow, and you might be better off reading the file in large chunks and then process the characters from memory. 这有可能是通过字符读取文件字符是缓慢的,你可能会更好阅读大块的文件,然后从内存中处理的字符。 But it's also possible that fread is doing that for you behind the scenes, so it might not buy you anything. 但是fread也有可能在幕后为您做到这一点,因此它可能不会给您带来任何好处。

Reducing the number of tests (comparisons) might help. 减少测试(比较)的数量可能会有所帮助。 For example, when you check for 10 (linefeed) or EOF , you have to do two tests for every character. 例如,当您检查10(换行符)或EOF ,必须对每个字符进行两次测试。 If you read the file into memory first, you could append a sentinel 10 to the end of the buffer, and that loop would then have to check only for linefeeds. 如果您首先将文件读入内存,则可以将前哨10附加到缓冲区的末尾,然后该循环将仅检查换行。

I ran a test that reads chars from a file using fgetc() , another using getc() ("e8" method) and a buffered version that collects the chars from a local buffer. 我运行了一个测试,该测试使用fgetc()从文件读取字符,另一个使用getc() (“ e8”方法)和从本地缓冲区收集字符的缓冲版本读取字符。

#include<stdio.h>
#include<stdlib.h>
#include<time.h>

#define BUFLEN  1024

FILE *fp;
char fname[] = "test.txt";
int bufsize, bufind;

int getachar() {
    static unsigned char buf[BUFLEN];
    if (bufind >= bufsize) {
        bufsize = fread(buf, sizeof(char), BUFLEN, fp);
        if (bufsize == 0)
            return -1;
        bufind = 0;
    }
    return buf[bufind++];
}

void WVmethod (void) {
    int temp, count=0;
    bufsize = bufind = 0;
    if ((fp = fopen(fname, "rt")) == NULL)
        return;
    while ((temp = getachar()) != -1) count++;
    fclose(fp);
    printf ("WV method read %d chars. ", count);
}

void OPmethod (void) {
    int temp, count=0;
    if ((fp = fopen(fname, "rt")) == NULL)
        return;
    while ((temp = fgetc(fp)) != EOF) count++;
    fclose(fp);
    printf ("OP method read %d chars. ", count);
}

void e8method (void) {
    int temp, count=0;
    if ((fp = fopen(fname, "rt")) == NULL)
        return;
    while ((temp = getc(fp)) != EOF) count++;
    fclose(fp);
    printf ("e8 method read %d chars. ", count);
}

int main()
{
    clock_t start, elapsed;
    int loop;

    for (loop=0; loop<3; loop++) {
        start = clock();
        WVmethod();
        elapsed = clock() - start;
        printf ("Clock ticks = %d\n", (int)elapsed);

        start = clock();
        OPmethod();
        elapsed = clock() - start;
        printf ("Clock ticks = %d\n", (int)elapsed);

        start = clock();
        e8method();
        elapsed = clock() - start;
        printf ("Clock ticks = %d\n", (int)elapsed);

        printf ("\n");
    }
    return 0;
}

Program output: 程序输出:

WV method read 24494400 chars. Clock ticks = 265
OP method read 24494400 chars. Clock ticks = 1575
e8 method read 24494400 chars. Clock ticks = 1544

WV method read 24494400 chars. Clock ticks = 266
OP method read 24494400 chars. Clock ticks = 1591
e8 method read 24494400 chars. Clock ticks = 1544

WV method read 24494400 chars. Clock ticks = 265
OP method read 24494400 chars. Clock ticks = 1607
e8 method read 24494400 chars. Clock ticks = 1545

My guess is that you are looking for basic functions to read file and that reading characters one by one is not the direction you are looking for. 我的猜测是您正在寻找读取文件的基本功能,而逐个读取字符并不是您要寻找的方向。

There are many functions to read and handle string in c. 有很多函数可以读取和处理c语言中的字符串。 In stdio.h are some functions that could help you : stdio.h中的一些功能可以帮助您:

  • char * fgets ( char * str, int num, FILE * stream ) : read characters until end of line or end of file, if num is large enough. char * fgets ( char * str, int num, FILE * stream ) :如果num足够大,则读取字符直到行尾或文件末尾。
  • int sscanf ( const char * s, const char * format, ...); : reads formatted input. :读取格式化的输入。 For instance fscanf(line,"%d",&nb); 例如fscanf(line,"%d",&nb); will read an integer and place it into nb . 将读取一个整数并将其放入nb It is not possible to call sscanf many times on the same string. 在同一字符串上不能多次调用sscanf But a bypass is to use strtok() of string.h to split a string using space " " as a separator. 但是绕过的是使用string.h strtok()来使用空格" "作为分隔符分割字符串。

Here a sample code doing the job : 这是完成此工作的示例代码:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>


#define MAX_LINE_SIZE 1000
#define MAX_SEQ_SIZE 100

int main()
{
    FILE * pFile;
    char line [MAX_LINE_SIZE];
    int nbvar,nbclauses;
    int* array=NULL;
    int i=0;int j;
    pFile = fopen ("example.txt" , "r");
    if (pFile == NULL){ perror ("Error opening file");}
    else {
        while (fgets(line, MAX_LINE_SIZE, pFile) != NULL){
            printf("%s",line);fflush(stdout);
            // parsing the line
            if(line[0]!='c' && line[0]!='\0'){
                if(line[0]=='p'){
                    sscanf(line,"%*s%*s%d%d",&nbvar,&nbclauses);
                    array=malloc(MAX_SEQ_SIZE*nbclauses*sizeof(int));
                }else{
                    char * temp;
                    char stop=0;
                    j=0;
                    //strtok split the line into token
                    temp=strtok(line," ");
                    while(stop==0){
                        sscanf(temp,"%d",&array[i*(MAX_SEQ_SIZE)+j]);
                        temp=strtok(NULL," ");
                        if(array[i*MAX_SEQ_SIZE+j]==0){stop=1;}
                        j++;
                        printf("j %d\n",j );fflush(stdout);
                    }
                    i++;
                }

            }
        }
        fclose (pFile);
    }
    if(array!=NULL){
        for(i=0;i<nbclauses;i++){
            j=0;
            while(array[i*MAX_SEQ_SIZE+j]!=0){
                printf("line %d seq item %d worths %d\n",i,j,array[i*MAX_SEQ_SIZE+j]);
                j++;
            }
        }
        free(array);
    }
    return 0;
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM