繁体   English   中英

使用 fgets 和 strtok() 读取文本文件 -C

[英]Using fgets and strtok() to read a text file -C

我正在尝试使用 fgets() 逐行读取 stdin 中的文本并将文本存储在变量“text”中。 但是,当我使用 strtok() 拆分单词时,它在终止之前仅适用于几行。 我应该更改什么以使其贯穿整个文本?


#define WORD_BUFFER_SIZE 50
#define TEXT_SIZE 200

int main(void) {
    char stopWords[TEXT_SIZE][WORD_BUFFER_SIZE];
    char word[WORD_BUFFER_SIZE];
    int numberOfWords = 0;
  
    while(scanf("%s", word) == 1){
      if (strcmp(word, "====") == 0){
        break;
      }
      strcpy(stopWords[numberOfWords], word);
      numberOfWords++;
    }

  char *buffer = malloc(sizeof(WORD_BUFFER_SIZE)*TEXT_SIZE);
  char *text = malloc(sizeof(WORD_BUFFER_SIZE)*TEXT_SIZE);
  
  while(fgets(buffer, WORD_BUFFER_SIZE*TEXT_SIZE, stdin) != NULL){  
    strcat(text, buffer);
  }
  
  char *k;
  k = strtok(text, " ");
  while (k != NULL) {
    printf("%s\n", k);
    k = strtok(NULL, " ");
  }
  
}

char *buffer = malloc(sizeof(WORD_BUFFER_SIZE)*TEXT_SIZE);
char *text = malloc(sizeof(WORD_BUFFER_SIZE)*TEXT_SIZE);

sizeof(WORD_BUFFER_SIZE)是一个常量,它是整数的大小。 您的意思可能是WORD_BUFFER_SIZE * TEXT_SIZE 但是您可以找到文件大小并准确计算您需要多少内存。

char *text = malloc(...)
strcat(text, buffer);

text未初始化且没有空终止符。 strcat需要知道text的结尾。 您必须在使用strcat之前设置text[0] = '\\0' (它不像strcpy

int main(void) 
{
    fseek(stdin, 0, SEEK_END);
    size_t filesize = ftell(stdin);
    rewind(stdin);
    if (filesize == 0)
    { printf("not using a file!\n"); return 0; }

    char word[1000] = { 0 };

    //while (scanf("%s", word) != 1)
    //    if (strcmp(word, "====") == 0)
    //        break;

    char* text = malloc(filesize + 1);
    if (!text)
        return 0;
    text[0] = '\0';
    while (fgets(word, sizeof(word), stdin) != NULL)
        strcat(text, word);

    char* k;
    k = strtok(text, " ");
    while (k != NULL) 
    {
        printf("%s\n", k);
        k = strtok(NULL, " ");
    }

    return 0;
}

根据您在评论部分提供的信息,输入文本长度超过 800 字节。

然而,在行

char *text = malloc(sizeof(WORD_BUFFER_SIZE)*TEXT_SIZE);

这相当于

char *text = malloc(800);

您只分配了 800 个字节作为text存储空间。 因此,您没有分配足够的空间来将整个输入存储到text 尝试存储超过 800 个字节将导致缓冲区溢出,从而引发未定义的行为

如果要将整个输入存储到text ,则必须确保它足够大。

但是,这可能不是必需的。 根据您的要求,一次处理一行可能就足够了,如下所示:

while( fgets( buffer, sizeof buffer, stdin ) != NULL )
{
    char *k = strtok( buffer, " " );

    while ( k != NULL )
    {
        printf( "%s\n", k );
        k = strtok( NULL, " " );
    }
}

在这种情况下,您不需要数组text 您只需要数组buffer来存储行的当前内容。

由于您没有提供任何示例输入,我无法测试上面的代码。


编辑:根据您对此答案的评论,您的主要问题似乎是当您事先不知道输入的长度时,如何从stdin读取所有输入并将其存储为字符串。

一种常见的解决方案是分配一个初始缓冲区,并在每次缓冲区满时将其大小加倍。 您可以为此使用函数realloc

#include <stdio.h>
#include <stdlib.h>

int main( void )
{
    char *buffer;
    size_t buffer_size = 1024;
    size_t input_size = 0;

    //allocate initial buffer
    buffer = malloc( buffer_size );
    if ( buffer == NULL )
    {
        fprintf( stderr, "allocation error!\n" );
        exit( EXIT_FAILURE );
    }

    //continuously fill the buffer with input, and
    //grow buffer as necessary
    for (;;) //infinite loop, equivalent to while(1)
    {
        //we must leave room for the terminating null character
        size_t to_read = buffer_size - input_size - 1;
        size_t ret;

        ret = fread( buffer + input_size, 1, to_read, stdin );

        input_size += ret;

        if ( ret != to_read )
        {
            //we have finished reading from input
            break;
        }

        //buffer was filled entirely (except for the space
        //reserved for the terminating null character), so
        //we must grow the buffer
        {
            void *temp;

            buffer_size *= 2;
            temp = realloc( buffer, buffer_size );

            if ( temp == NULL )
            {
                fprintf( stderr, "allocation error!\n" );
                exit( EXIT_FAILURE );
            }

            buffer = temp;
        }
    }

    //make sure that `fread` did not fail end due to
    //error (it should only end due to end-of-file)
    if ( ferror(stdin) )
    {
        fprintf( stderr, "input error!\n" );
        exit( EXIT_FAILURE );
    }

    //add terminating null character
    buffer[input_size++] = '\0';

    //shrink buffer to required size
    {
        void *temp;

        temp = realloc( buffer, input_size );

        if ( temp == NULL )
        {
            fprintf( stderr, "allocation error!\n" );
            exit( EXIT_FAILURE );
        }

        buffer = temp;
    }

    //the entire contents is now stored in "buffer" as a
    //string, and can be printed
    printf( "contents of buffer:\n%s\n", buffer );

    free( buffer );
}

上面的代码假设输入将因文件结束条件而终止,如果输入是从文件中通过管道传输的,则可能就是这种情况。

再想一想,不是像您在代码中所做的那样为整个文件使用一个大字符串,而是希望将char*数组用于各个字符串,每个字符串都代表一行,例如lines[0]将是第一行的字符串, lines[1]将是第二行的字符串。 这样,您可以轻松地使用strstr查找每一行上的“ ==== ” strchrstrchr以查找单个单词,并且仍然在内存中保留所有行以供进一步处理。

在这种情况下,我不建议您使用strtok ,因为该函数通过用空字符替换分隔符来修改字符串,因此它具有破坏性。 如果您需要进一步处理字符串,如您在评论部分所述,那么这可能不是您想要的。 这就是我建议您改用strchr

如果在编译时已知合理的最大行数,那么解决方案就很简单:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define MAX_LINE_LENGTH 1024
#define MAX_LINES 1024

int main( void )
{
    char *lines[MAX_LINES];
    int num_lines = 0;

    char buffer[MAX_LINE_LENGTH];

    //read one line per loop iteration
    while ( fgets( buffer, sizeof buffer, stdin ) != NULL )
    {
        int line_length = strlen( buffer );

        //verify that entire line was read in
        if ( buffer[line_length-1] != '\n' )
        {
            //treat end-of file as equivalent to newline character
            if ( !feof( stdin ) )
            {
                fprintf( stderr, "input line exceeds maximum line length!\n" );
                exit( EXIT_FAILURE );
            }
        }
        else
        {
            //remove newline character from string
            buffer[--line_length] = '\0';
        }

        //allocate memory for new string and add to array
        lines[num_lines] = malloc( line_length + 1 );

        //verify that "malloc" succeeded
        if ( lines[num_lines] == NULL )
        {
            fprintf( stderr, "allocation error!\n" );
            exit( EXIT_FAILURE );
        }

        //copy line to newly allocated buffer
        strcpy( lines[num_lines], buffer );

        //increment counter
        num_lines++;
    }

    //All input lines have now been successfully read in, so
    //we can now do something with them.

    //handle one line per loop iteration
    for ( int i = 0; i < num_lines; i++ )
    {
        char *p, *q;

        //attempt to find the " ==== " marker
        p = strstr( lines[i], " ==== " );
        if ( p == NULL )
        {
            printf( "Warning: skipping line because unable to find \" ==== \".\n" );
            continue;
        }

        //skip the " ==== " marker
        p += 6;

        //split tokens on remainder of line using "strchr"
        while ( ( q = strchr( p, ' ') ) != NULL )
        {
            printf( "found token: %.*s\n", (int)(q-p), p );
            p = q + 1;
        }

        //output last token
        printf( "found token: %s\n", p );
    }

    //cleanup allocated memory
    for ( int i = 0; i < num_lines; i++ )
    {
        free( lines[i] );
    }
}

当使用以下输入运行上面的程序时

first line before deliminator ==== first line after deliminator
second line before deliminator ==== second line after deliminator

它有以下输出:

found token: first
found token: line
found token: after
found token: deliminator
found token: second
found token: line
found token: after
found token: deliminator

但是,如果在编译时没有已知的合理最大行数,则数组lines也必须设计为以与前一个程序中的buffer类似的方式增长。 这同样适用于最大线路长度。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM