简体   繁体   中英

Read from a text file and parse lines into words in C

I'm a beginner in C and system programming. For a homework assignment, I need to write a program that reads input from stdin parsing lines into words and sending words to the sort sub-processes using System V message queues (eg, count words). I got stuck at the input part. I'm trying to process the input, remove non-alpha characters, put all alpha words in lower case and lastly, split a line of words into multiple words. So far I can print all alpha words in lower case, but there are lines between words, which I believe isn't correct. Can someone take a look and give me some suggestions?

Example from a text file: The Project Gutenberg EBook of The Iliad of Homer, by Homer

I think the correct output should be:

the
project
gutenberg
ebook
of
the
iliad
of
homer
by
homer

But my output is the following:

project
gutenberg
ebook
of
the
iliad
of
homer
                         <------There is a line there
by
homer

I think the empty line is caused by the space between "," and "by". I tried things like "if isspace(c) then do nothing", but it doesn't work. My code is below. Any help or suggestion is appreciated.

#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>
#include <fcntl.h>
#include <errno.h>
#include <unistd.h>
#include <string.h>


//Main Function
int main (int argc, char **argv)
{
    int c;
    char *input = argv[1];
    FILE *input_file;

    input_file = fopen(input, "r");

    if (input_file == 0)
    {
        //fopen returns 0, the NULL pointer, on failure
        perror("Canot open input file\n");
        exit(-1);
    }
    else
    {        
        while ((c =fgetc(input_file)) != EOF )
        {
            //if it's an alpha, convert it to lower case
            if (isalpha(c))
            {
                c = tolower(c);
                putchar(c);
            }
            else if (isspace(c))
            {
                ;   //do nothing
            }
            else
            {
                c = '\n';
                putchar(c);
            }
        }
    }

    fclose(input_file);

    printf("\n");

    return 0;
}

EDIT **

I edited my code and finally got the correct output:

int main (int argc, char **argv)
{
    int c;
    char *input = argv[1];
    FILE *input_file;

    input_file = fopen(input, "r");

    if (input_file == 0)
    {
        //fopen returns 0, the NULL pointer, on failure
        perror("Canot open input file\n");
        exit(-1);
    }
    else
    {
        int found_word = 0;

        while ((c =fgetc(input_file)) != EOF )
        {
            //if it's an alpha, convert it to lower case
            if (isalpha(c))
            {
                found_word = 1;
                c = tolower(c);
                putchar(c);
            }
            else {
                if (found_word) {
                    putchar('\n');
                    found_word=0;
                }
            }

        }
    }

    fclose(input_file);

    printf("\n");

    return 0;
}

I think that you just need to ignore any non-alpha character !isalpha(c) otherwise convert to lowercase. You will need to keep track when you find a word in this case.

int found_word = 0;

while ((c =fgetc(input_file)) != EOF )
{
    if (!isalpha(c))
    {
        if (found_word) {
            putchar('\n');
            found_word = 0;
        }
    }
    else {
        found_word = 1;
        c = tolower(c);
        putchar(c);
    }
}

If you need to handle apostrophes within words such as "isn't" then this should do it -

int found_word = 0;
int found_apostrophe = 0;
    while ((c =fgetc(input_file)) != EOF )
    {
    if (!isalpha(c))
    {
        if (found_word) {
            if (!found_apostrophe && c=='\'') {
                found_apostrophe = 1;
            }
            else {
                found_apostrophe = 0;
                putchar('\n');
                found_word = 0;
            }
                }
    }
    else {
        if (found_apostrophe) {
            putchar('\'');
            found_apostrophe = 0;
        }
        found_word = 1;
        c = tolower(c);
        putchar(c);
    }
}

I suspect you really want to handle all non-alphabetical characters as separators, not just handle spaces as separators and ignore non-alphabetical characters. Otherwise, foo--bar would show up as a single word foobar , right? The good news is, that makes things easier. You can remove the isspace clause, and just use the else clause.

Meanwhile, whether you treat punctuations specially or not, you've got a problem: You print a newline for any space at all. So, a line that ends with \\r\\n or \\n , or even a sentence that ends with . , will print a blank line. The obvious way around that is to keep track of the last character, or a flag, so you only print a newline if you've previously printed a letter.

For example:

int last_c = 0

while ((c = fgetc(input_file)) != EOF )
{
    //if it's an alpha, convert it to lower case
    if (isalpha(c))
    {
        c = tolower(c);
        putchar(c);
    }
    else if (isalpha(last_c))
    {
        putchar(c);
    }
    last_c = c;
}

But do you really want to treat all punctuation the same? The problem statement implies that you do, but in real life, that's a bit odd. For example, foo--bar should probably show up as separate words foo and bar , but should it's really show up as separate words it and s ? For that matter, using isalpha as your rule for "word characters" also means that, say, 2nd will show up as nd .

So, if isascii isn't the appropriate rule for your use case to distinguish word characters from separator characters, you'll have to write your own function that makes the right distinction. You can easily express such a rule in logic (eg, isalnum(c) || c == '\\'' ) or with a table (just an array of 128 ints, so the function is c >= 0 && c < 128 && word_char_table[c] ). Doing things that way has the added benefit that you can later extend your code to deal with Latin-1 or Unicode, or to handle program text (which has different word characters than English language text), or …

It appears that you are separating words by spaces, so I think just

while ((c =fgetc(input_file)) != EOF )
{
    if (isalpha(c))
    {
        c = tolower(c);
        putchar(c);
    }
    else if (isspace(c))
    {
       putchar('\n');
    }
}

will work too. Provided your input text won't have more than one space between words.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM