简体   繁体   中英

How to read in words from an input file that ignores punctuation using fscanf?

I am trying to use fscanf to read in from an input file while only reading in the letter and ignoring the special characters like commas, periods, etc. I tried the code below but it does not print anything when I try to print each input word.

I have also tried "%20[a-zA-Z]" and "%20[a-zA-Z] " in the fscanf .

char** input;
input = (char **)malloc(numWordsInput*sizeof(char*));

for (i = 0; i < numWordsInput; i++)
{
  fscanf(in_file, "%s", buffer);
  sLength = strlen(buffer)+1;
  input[i] = (char *)malloc(sLength*sizeof(char));
}
rewind(in_file);
for (i = 0; i < numWordsInput; i++)
{
  fscanf(in_file, "%20[a-zA-Z]%*[a-zA-Z]", input[i]);
}

It is unclear why you are attempting to create a pointer-to-pointer to char for each word, and then allocating for each word, but to simply classify the characters that are [a-zA-Z] the C-library provides a number of macros in ctype.h like isalpha() that do exactly that.

(OK, your comment about storing words came as I was done with this part of the answer, so I'll add the word handling in a minute)

To handle file input and to check whether each character is [a-zA-Z] , all you need to do is open the file and use a character oriented input function like fgetc and test each character with isalpha() . A short example that does just that is:

#include <stdio.h>
#include <ctype.h>

int main (int argc, char **argv) {

    int c;
    /* use filename provided as 1st argument (stdin by default) */
    FILE *fp = argc > 1 ? fopen (argv[1], "r") : stdin;

    if (!fp) {  /* validate file open for reading */
        perror ("file open failed");
        return 1;
    }

    while ((c = fgetc (fp)) != EOF)     /* read each char in file */
        if (isalpha (c) || c == '\n')   /* is it a-zA-Z or \n  */
            putchar (c);                /* output it */

    if (fp != stdin) fclose (fp);   /* close file if not stdin */

    return 0;
}

(the basic stream I/O is buffered anyway ( 8192 bytes on Linux), so you don't incur a penalty from not reading into a larger buffer)

Example Input File

So if you had a messy input file:

$ cat ../dat/10intmess.txt
8572,;a -2213,;--a 6434,;
a- 16330,;a

- The Quick
Brown%3034 Fox
12346Jumps Over
A
4855,;*;Lazy 16985/,;a
Dog.
11250
1495

Example Use/Output

... and simply wanted to pick the [a-zA-Z] characters from it (and the '\\n' characters to preserve line spacing for the example), you would get:

$ ./bin/readalpha ../dat/10intmess.txt
aa
aa

TheQuick
BrownFox
JumpsOver
A
Lazya
Dog

If you wanted to also include [0-9] , you would simply use isalnum (c) instead of isalpha (c) .

You are also free to read a line at a time (or a word at a time) for that matter and simply walk-a-pointer down the buffer doing the same thing. For instance you could do:

#include <stdio.h>
#include <ctype.h>

#define MAXC 4096

int main (int argc, char **argv) {

    char buf[MAXC];
    /* use filename provided as 1st argument (stdin by default) */
    FILE *fp = argc > 1 ? fopen (argv[1], "r") : stdin;

    if (!fp) {  /* validate file open for reading */
        perror ("file open failed");
        return 1;
    }

    while (fgets (buf, MAXC, fp)) {             /* read each line in file */
        char *p = buf;                          /* pointer to bufffer */
        while (*p) {                            /* loop over each char */
            if (isalpha (*p) || *p == '\n')     /* is it a-zA-Z or \n  */
                putchar (*p);                   /* output it */
            p++;
        }
    }
    if (fp != stdin) fclose (fp);   /* close file if not stdin */

    return 0;
}

(output is the same)

Or if you prefer using indexes rather than a pointer, you could use:

    while (fgets (buf, MAXC, fp))               /* read each line in file */
        for (int i = 0; buf[i]; i++)            /* loop over each char */
            if (isalpha (buf[i]) || buf[i] == '\n') /* is it a-zA-Z or \n  */
                putchar (buf[i]);               /* output it */

(output is the same)

Look things over and let me know if you have questions. If you do need to do it a word at a time, you will have to significantly add to your code to protect your number of pointers and realloc as required. Give me a sec and I'll help there, in the mean time, digest the basic character classification above.

Allocating And Storing Individual Words of Only Alpha-Characters

As you can imaging, dynamically allocating pointers and then allocating for, and storing each word made up of only alpha-characters is a bit more involved. It's not any more difficult, your simply have to keep track of the number of pointers allocated, and if you have used all allocated pointers, reallocate and keep going.

The place where new C programmers usually get into trouble is failing to validate each required step to ensure each allocation succeeds to avoid writing to memory you don't own invoking Undefined Behavior .

Reading individual words with fscanf is fine. Then to ensure you have alpha characters to store, it makes sense to extract the alpha characters into a separate temporary buffer and checking whether there were actually any stored, before allocating storage for that word. The longest word in the non-medical unabridged dictionary is 29-characters, so a fixed buffer larger than that will suffice ( 1024 chars is used below -- Don't Skimp on Buffer Size! )

So what you need for storing each word and tracking the number of pointers allocated and number of pointers used, as well as your fixed buffer to read into would be similar to:

#define NPTR    8   /* initial number of pointers */
#define MAXC 1024
...
    char **input,           /* pointers to words */
        buf[MAXC];          /* read buffer */
    size_t  nptr = NPTR,    /* number of allcoated pointers */
            used = 0;       /* number of used pointers */

After allocating your initial number of pointers you can read each word and then parse the alpha-characters from it similar to the following:

    while (fscanf (fp, "%s", buf) == 1) {       /* read each word in file  */
        size_t ndx = 0;                         /* alpha char index */
        char tmp[MAXC];                         /* temp buffer for alpha */
        if (used == nptr)                       /* check if realloc needed */
            input = xrealloc2 (input, sizeof *input, &nptr);    /* realloc */
        for (int i = 0; buf[i]; i++)            /* loop over each char */
            if (isalpha (buf[i]))               /* is it a-zA-Z or \n  */
                tmp[ndx++] = buf[i];            /* store alpha chars */
        if (!ndx)                               /* if no alpha-chars */
            continue;                           /* get next word */
        tmp[ndx] = 0;                           /* nul-terminate chars */
        input[used] = dupstr (tmp);             /* allocate/copy tmp */
        if (!input[used]) {                     /* validate word storage */
            if (used)           /* if words already stored */
                break;          /* break, earlier words still good */
            else {              /* otherwise bail */
                fputs ("error: allocating 1st word.\n", stderr);
                return 1;
            }
        }
        used++;                                 /* increment used count */
    }

( note: when the number of used pointers equals the number allocated, then the input is reallocated to twice the current number of pointers)

The xrealloc2 and dupstr functions are simply helper functions. xrealloc2 simply calls realloc and doubles the size of the current allocation, validating the allocation and returning the reallocated pointer on success, or currently exiting on failure -- you can change it to return NULL to handle the error if you like.

/** realloc 'ptr' of 'nelem' of 'psz' to 'nelem * 2' of 'psz'.
 *  returns pointer to reallocated block of memory with new
 *  memory initialized to 0/NULL. return must be assigned to
 *  original pointer in caller.
 */
void *xrealloc2 (void *ptr, size_t psz, size_t *nelem)
{   void *memptr = realloc ((char *)ptr, *nelem * 2 * psz);
    if (!memptr) {
        perror ("realloc(): virtual memory exhausted.");
        exit (EXIT_FAILURE);
        /* return NULL; */
    }   /* zero new memory (optional) */
    memset ((char *)memptr + *nelem * psz, 0, *nelem * psz);
    *nelem *= 2;
    return memptr;
}

The dupstr function is just a normal strdup , but since not all compilers provide strdup , it is used to ensure portability.

/** allocate storage for s + 1 chars and copy contents of s
 *  to allocated block returning new sting on success,
 *  NULL otherwise.
 */
char *dupstr (const char *s)
{
    size_t len = strlen (s);
    char *str = malloc (len + 1);

    if (!str)
        return NULL;

    return memcpy (str, s, len + 1);
}

Using the helpers just keeps the main body of your code a tad cleaner rather than cramming it all into your loop.

Putting it altogether you could do:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>

#define NPTR    8   /* initial number of pointers */
#define MAXC 1024

void *xrealloc2 (void *ptr, size_t psz, size_t *nelem);
char *dupstr (const char *s);

int main (int argc, char **argv) {

    char **input,           /* pointers to words */
        buf[MAXC];          /* read buffer */
    size_t  nptr = NPTR,    /* number of allcoated pointers */
            used = 0;       /* number of used pointers */
    /* use filename provided as 1st argument (stdin by default) */
    FILE *fp = argc > 1 ? fopen (argv[1], "r") : stdin;

    if (!fp) {  /* validate file open for reading */
        perror ("file open failed");
        return 1;
    }

    input = malloc (nptr * sizeof *input);  /* allocate nptr pointers */
    if (!input) {                           /* validate every allocation */
        perror ("malloc-input");
        return 1;
    }

    while (fscanf (fp, "%s", buf) == 1) {       /* read each word in file  */
        size_t ndx = 0;                         /* alpha char index */
        char tmp[MAXC];                         /* temp buffer for alpha */
        if (used == nptr)                       /* check if realloc needed */
            input = xrealloc2 (input, sizeof *input, &nptr);    /* realloc */
        for (int i = 0; buf[i]; i++)            /* loop over each char */
            if (isalpha (buf[i]))               /* is it a-zA-Z or \n  */
                tmp[ndx++] = buf[i];            /* store alpha chars */
        if (!ndx)                               /* if no alpha-chars */
            continue;                           /* get next word */
        tmp[ndx] = 0;                           /* nul-terminate chars */
        input[used] = dupstr (tmp);             /* allocate/copy tmp */
        if (!input[used]) {                     /* validate word storage */
            if (used)           /* if words already stored */
                break;          /* break, earlier words still good */
            else {              /* otherwise bail */
                fputs ("error: allocating 1st word.\n", stderr);
                return 1;
            }
        }
        used++;                                 /* increment used count */
    }
    if (fp != stdin) fclose (fp);   /* close file if not stdin */

    for (size_t i = 0; i < used; i++) {
        printf ("word[%3zu]: %s\n", i, input[i]);
        free (input[i]);    /* free storage when done with word */
    }
    free (input);           /* free pointers */

    return 0;
}

/** realloc 'ptr' of 'nelem' of 'psz' to 'nelem * 2' of 'psz'.
 *  returns pointer to reallocated block of memory with new
 *  memory initialized to 0/NULL. return must be assigned to
 *  original pointer in caller.
 */
void *xrealloc2 (void *ptr, size_t psz, size_t *nelem)
{   void *memptr = realloc ((char *)ptr, *nelem * 2 * psz);
    if (!memptr) {
        perror ("realloc(): virtual memory exhausted.");
        exit (EXIT_FAILURE);
        /* return NULL; */
    }   /* zero new memory (optional) */
    memset ((char *)memptr + *nelem * psz, 0, *nelem * psz);
    *nelem *= 2;
    return memptr;
}

/** allocate storage for s + 1 chars and copy contents of s
 *  to allocated block returning new sting on success,
 *  NULL otherwise.
 */
char *dupstr (const char *s)
{
    size_t len = strlen (s);
    char *str = malloc (len + 1);

    if (!str)
        return NULL;

    return memcpy (str, s, len + 1);
}

(same input file is used)

Example Use/Output

$ ./bin/readalphadyn ../dat/10intmess.txt
word[  0]: a
word[  1]: a
word[  2]: a
word[  3]: a
word[  4]: The
word[  5]: Quick
word[  6]: Brown
word[  7]: Fox
word[  8]: Jumps
word[  9]: Over
word[ 10]: A
word[ 11]: Lazy
word[ 12]: a
word[ 13]: Dog

Memory Use/Error Check

In any code you write that dynamically allocates memory, you have 2 responsibilities regarding any block of memory allocated: (1) always preserve a pointer to the starting address for the block of memory so, (2) it can be freed when it is no longer needed.

It is imperative that you use a memory error checking program to insure you do not attempt to access memory or write beyond/outside the bounds of your allocated block, attempt to read or base a conditional jump on an uninitialized value, and finally, to confirm that you free all the memory you have allocated.

For Linux valgrind is the normal choice. There are similar memory checkers for every platform. They are all simple to use, just run your program through it.

$ valgrind ./bin/readalphadyn ../dat/10intmess.txt
==8765== Memcheck, a memory error detector
==8765== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==8765== Using Valgrind-3.12.0 and LibVEX; rerun with -h for copyright info
==8765== Command: ./bin/readalphadyn ../dat/10intmess.txt
==8765==
word[  0]: a
word[  1]: a
word[  2]: a
word[  3]: a
word[  4]: The
word[  5]: Quick
word[  6]: Brown
word[  7]: Fox
word[  8]: Jumps
word[  9]: Over
word[ 10]: A
word[ 11]: Lazy
word[ 12]: a
word[ 13]: Dog
==8765==
==8765== HEAP SUMMARY:
==8765==     in use at exit: 0 bytes in 0 blocks
==8765==   total heap usage: 17 allocs, 17 frees, 796 bytes allocated
==8765==
==8765== All heap blocks were freed -- no leaks are possible
==8765==
==8765== For counts of detected and suppressed errors, rerun with: -v
==8765== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

Always confirm that you have freed all memory you have allocated and that there are no memory errors.

( note: There is no need to cast the return of malloc , it is unnecessary. See: Do I cast the result of malloc? )

To skip single-character words (or pick the limit you want), you can simply change:

    if (ndx < 2)                            /* if 0/1 alpha-chars */
        continue;                           /* get next word */

Doing that would changed your stored words to:

$ ./bin/readalphadyn ../dat/10intmess.txt
word[  0]: The
word[  1]: Quick
word[  2]: Brown
word[  3]: Fox
word[  4]: Jumps
word[  5]: Over
word[  6]: Lazy
word[  7]: Dog

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM