简体   繁体   中英

Using fgets and strtok() to read a text file -C

I'm trying to read text from stdin line by line using fgets() and store the text in a variable “text”. However, when I use strtok() to split the words, it only works for a couple lines before terminating. What should I change to make it run through the entire text?


#define WORD_BUFFER_SIZE 50
#define TEXT_SIZE 200

int main(void) {
    char stopWords[TEXT_SIZE][WORD_BUFFER_SIZE];
    char word[WORD_BUFFER_SIZE];
    int numberOfWords = 0;
  
    while(scanf("%s", word) == 1){
      if (strcmp(word, "====") == 0){
        break;
      }
      strcpy(stopWords[numberOfWords], word);
      numberOfWords++;
    }

  char *buffer = malloc(sizeof(WORD_BUFFER_SIZE)*TEXT_SIZE);
  char *text = malloc(sizeof(WORD_BUFFER_SIZE)*TEXT_SIZE);
  
  while(fgets(buffer, WORD_BUFFER_SIZE*TEXT_SIZE, stdin) != NULL){  
    strcat(text, buffer);
  }
  
  char *k;
  k = strtok(text, " ");
  while (k != NULL) {
    printf("%s\n", k);
    k = strtok(NULL, " ");
  }
  
}

char *buffer = malloc(sizeof(WORD_BUFFER_SIZE)*TEXT_SIZE);
char *text = malloc(sizeof(WORD_BUFFER_SIZE)*TEXT_SIZE);

sizeof(WORD_BUFFER_SIZE) is a constant, it's the size of integer. You probably mean WORD_BUFFER_SIZE * TEXT_SIZE . But you can find the file size and calculate exactly how much memory you need.

char *text = malloc(...)
strcat(text, buffer);

text is not initialized and doesn't have a null-terminator. strcat needs to know the end of text . You have to set text[0] = '\\0' before using strcat (it's not like strcpy )

int main(void) 
{
    fseek(stdin, 0, SEEK_END);
    size_t filesize = ftell(stdin);
    rewind(stdin);
    if (filesize == 0)
    { printf("not using a file!\n"); return 0; }

    char word[1000] = { 0 };

    //while (scanf("%s", word) != 1)
    //    if (strcmp(word, "====") == 0)
    //        break;

    char* text = malloc(filesize + 1);
    if (!text)
        return 0;
    text[0] = '\0';
    while (fgets(word, sizeof(word), stdin) != NULL)
        strcat(text, word);

    char* k;
    k = strtok(text, " ");
    while (k != NULL) 
    {
        printf("%s\n", k);
        k = strtok(NULL, " ");
    }

    return 0;
}

According to the information you provided in the comments section, the input text is longer than 800 bytes.

However, in the line

char *text = malloc(sizeof(WORD_BUFFER_SIZE)*TEXT_SIZE);

which is equivalent to

char *text = malloc(800);

you only allocated 800 bytes as storage for text . Therefore, you did not allocate sufficient space to store the entire input into text . Attempting to store more than 800 bytes will result in a buffer overflow , which invokes undefined behavior .

If you want to store the entire input into text , then you must ensure that it is large enough.

However, this is probably not necessary. Depending on your requirements, it is probably sufficient to process one line at a time, like this:

while( fgets( buffer, sizeof buffer, stdin ) != NULL )
{
    char *k = strtok( buffer, " " );

    while ( k != NULL )
    {
        printf( "%s\n", k );
        k = strtok( NULL, " " );
    }
}

In that case, you do not need the array text . You only need the array buffer for storing the current contents of the line.

Since you did not provide any sample input, I cannot test the code above.


EDIT: Based on your comments to this answer, it seems that your main problem is how to read in all of the input from stdin and store it as a string, when you do not know the length of the input in advance.

One common solution is to allocate an initial buffer, and to double its size every time it gets full. You can use the function realloc for this:

#include <stdio.h>
#include <stdlib.h>

int main( void )
{
    char *buffer;
    size_t buffer_size = 1024;
    size_t input_size = 0;

    //allocate initial buffer
    buffer = malloc( buffer_size );
    if ( buffer == NULL )
    {
        fprintf( stderr, "allocation error!\n" );
        exit( EXIT_FAILURE );
    }

    //continuously fill the buffer with input, and
    //grow buffer as necessary
    for (;;) //infinite loop, equivalent to while(1)
    {
        //we must leave room for the terminating null character
        size_t to_read = buffer_size - input_size - 1;
        size_t ret;

        ret = fread( buffer + input_size, 1, to_read, stdin );

        input_size += ret;

        if ( ret != to_read )
        {
            //we have finished reading from input
            break;
        }

        //buffer was filled entirely (except for the space
        //reserved for the terminating null character), so
        //we must grow the buffer
        {
            void *temp;

            buffer_size *= 2;
            temp = realloc( buffer, buffer_size );

            if ( temp == NULL )
            {
                fprintf( stderr, "allocation error!\n" );
                exit( EXIT_FAILURE );
            }

            buffer = temp;
        }
    }

    //make sure that `fread` did not fail end due to
    //error (it should only end due to end-of-file)
    if ( ferror(stdin) )
    {
        fprintf( stderr, "input error!\n" );
        exit( EXIT_FAILURE );
    }

    //add terminating null character
    buffer[input_size++] = '\0';

    //shrink buffer to required size
    {
        void *temp;

        temp = realloc( buffer, input_size );

        if ( temp == NULL )
        {
            fprintf( stderr, "allocation error!\n" );
            exit( EXIT_FAILURE );
        }

        buffer = temp;
    }

    //the entire contents is now stored in "buffer" as a
    //string, and can be printed
    printf( "contents of buffer:\n%s\n", buffer );

    free( buffer );
}

The code above assumes that the input will be terminated by an end of file condition, which is probably the case if the input is piped from a file.

On second thought, instead of having one large string for the whole file, as you are doing in your code, you may rather want an array of char* to the individual strings, each representing a line, so that for example lines[0] will be the string of the first line, lines[1] will be the string of the second line. That way, you can easily use strstr to find the " ==== " deliminator and strchr on each individual line to find the individual words, and still have all the lines in memory for further processing.

I don't recommend that you use strtok in this case, because that function is destructive in the sense that it modifies the string, by replacing the deliminators with null characters. If you require the strings for further processing, as you stated in the comments section, then this is probably not what you want. That is why I recommend that you use strchr instead.

If a reasonable maximum number of lines is known at compile-time, then the solution is rather easy:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define MAX_LINE_LENGTH 1024
#define MAX_LINES 1024

int main( void )
{
    char *lines[MAX_LINES];
    int num_lines = 0;

    char buffer[MAX_LINE_LENGTH];

    //read one line per loop iteration
    while ( fgets( buffer, sizeof buffer, stdin ) != NULL )
    {
        int line_length = strlen( buffer );

        //verify that entire line was read in
        if ( buffer[line_length-1] != '\n' )
        {
            //treat end-of file as equivalent to newline character
            if ( !feof( stdin ) )
            {
                fprintf( stderr, "input line exceeds maximum line length!\n" );
                exit( EXIT_FAILURE );
            }
        }
        else
        {
            //remove newline character from string
            buffer[--line_length] = '\0';
        }

        //allocate memory for new string and add to array
        lines[num_lines] = malloc( line_length + 1 );

        //verify that "malloc" succeeded
        if ( lines[num_lines] == NULL )
        {
            fprintf( stderr, "allocation error!\n" );
            exit( EXIT_FAILURE );
        }

        //copy line to newly allocated buffer
        strcpy( lines[num_lines], buffer );

        //increment counter
        num_lines++;
    }

    //All input lines have now been successfully read in, so
    //we can now do something with them.

    //handle one line per loop iteration
    for ( int i = 0; i < num_lines; i++ )
    {
        char *p, *q;

        //attempt to find the " ==== " marker
        p = strstr( lines[i], " ==== " );
        if ( p == NULL )
        {
            printf( "Warning: skipping line because unable to find \" ==== \".\n" );
            continue;
        }

        //skip the " ==== " marker
        p += 6;

        //split tokens on remainder of line using "strchr"
        while ( ( q = strchr( p, ' ') ) != NULL )
        {
            printf( "found token: %.*s\n", (int)(q-p), p );
            p = q + 1;
        }

        //output last token
        printf( "found token: %s\n", p );
    }

    //cleanup allocated memory
    for ( int i = 0; i < num_lines; i++ )
    {
        free( lines[i] );
    }
}

When running the program above with the following input

first line before deliminator ==== first line after deliminator
second line before deliminator ==== second line after deliminator

it has the following output:

found token: first
found token: line
found token: after
found token: deliminator
found token: second
found token: line
found token: after
found token: deliminator

If, however, there is no reasonable maximum number of lines known at compile-time, then the array lines will also have to be designed to grow in a similar way as buffer in the previous program. The same applies for the maximum line length.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM