简体   繁体   English

如何使用fscanf从忽略标点的输入文件中读取单词?

[英]How to read in words from an input file that ignores punctuation using fscanf?

I am trying to use fscanf to read in from an input file while only reading in the letter and ignoring the special characters like commas, periods, etc. I tried the code below but it does not print anything when I try to print each input word. 我试图使用fscanf从输入文件中读入,而仅读入字母并忽略特殊字符(例如逗号,句点等)。我尝试了下面的代码,但是当我尝试打印每个输入单词时它不打印任何内容。

I have also tried "%20[a-zA-Z]" and "%20[a-zA-Z] " in the fscanf . 我还在fscanf尝试了"%20[a-zA-Z]""%20[a-zA-Z] "

char** input;
input = (char **)malloc(numWordsInput*sizeof(char*));

for (i = 0; i < numWordsInput; i++)
{
  fscanf(in_file, "%s", buffer);
  sLength = strlen(buffer)+1;
  input[i] = (char *)malloc(sLength*sizeof(char));
}
rewind(in_file);
for (i = 0; i < numWordsInput; i++)
{
  fscanf(in_file, "%20[a-zA-Z]%*[a-zA-Z]", input[i]);
}

It is unclear why you are attempting to create a pointer-to-pointer to char for each word, and then allocating for each word, but to simply classify the characters that are [a-zA-Z] the C-library provides a number of macros in ctype.h like isalpha() that do exactly that. 目前尚不清楚为什么您要尝试为每个单词创建一个指向 char 的指针 ,然后为每个单词分配,但是为了简单地 [a-zA-Z]字符进行分类 ,C库提供了一个数字像isalpha()这样的ctype.h的宏可以做到这一点。

(OK, your comment about storing words came as I was done with this part of the answer, so I'll add the word handling in a minute) (好的,您对单词存储的评论是在答案的这一部分完成后得出的,因此,我将在一分钟内添加单词处理功能)

To handle file input and to check whether each character is [a-zA-Z] , all you need to do is open the file and use a character oriented input function like fgetc and test each character with isalpha() . 要处理文件输入并检查每个字符是否为[a-zA-Z] ,您要做的就是打开文件并使用面向字符的输入函数(如fgetc并使用isalpha()测试每个字符。 A short example that does just that is: 一个简短的例子就是:

#include <stdio.h>
#include <ctype.h>

int main (int argc, char **argv) {

    int c;
    /* use filename provided as 1st argument (stdin by default) */
    FILE *fp = argc > 1 ? fopen (argv[1], "r") : stdin;

    if (!fp) {  /* validate file open for reading */
        perror ("file open failed");
        return 1;
    }

    while ((c = fgetc (fp)) != EOF)     /* read each char in file */
        if (isalpha (c) || c == '\n')   /* is it a-zA-Z or \n  */
            putchar (c);                /* output it */

    if (fp != stdin) fclose (fp);   /* close file if not stdin */

    return 0;
}

(the basic stream I/O is buffered anyway ( 8192 bytes on Linux), so you don't incur a penalty from not reading into a larger buffer) (无论如何,基本流I / O都已缓冲(在Linux上为8192字节),因此不会因不读入更大的缓冲区而受到惩罚)

Example Input File 输入文件示例

So if you had a messy input file: 因此,如果您的输入文件混乱:

$ cat ../dat/10intmess.txt
8572,;a -2213,;--a 6434,;
a- 16330,;a

- The Quick
Brown%3034 Fox
12346Jumps Over
A
4855,;*;Lazy 16985/,;a
Dog.
11250
1495

Example Use/Output 使用/输出示例

... and simply wanted to pick the [a-zA-Z] characters from it (and the '\\n' characters to preserve line spacing for the example), you would get: ...并只想从中选择[a-zA-Z]字符(和'\\n'字符以保留示例的行距),您将得到:

$ ./bin/readalpha ../dat/10intmess.txt
aa
aa

TheQuick
BrownFox
JumpsOver
A
Lazya
Dog

If you wanted to also include [0-9] , you would simply use isalnum (c) instead of isalpha (c) . 如果您还想包含[0-9] ,则只需使用isalnum (c)而不是isalpha (c)

You are also free to read a line at a time (or a word at a time) for that matter and simply walk-a-pointer down the buffer doing the same thing. 您也可以一次阅读一行(或一次阅读一个单词),而只需在指针上向下滑动指针即可完成相同的操作。 For instance you could do: 例如,您可以执行以下操作:

#include <stdio.h>
#include <ctype.h>

#define MAXC 4096

int main (int argc, char **argv) {

    char buf[MAXC];
    /* use filename provided as 1st argument (stdin by default) */
    FILE *fp = argc > 1 ? fopen (argv[1], "r") : stdin;

    if (!fp) {  /* validate file open for reading */
        perror ("file open failed");
        return 1;
    }

    while (fgets (buf, MAXC, fp)) {             /* read each line in file */
        char *p = buf;                          /* pointer to bufffer */
        while (*p) {                            /* loop over each char */
            if (isalpha (*p) || *p == '\n')     /* is it a-zA-Z or \n  */
                putchar (*p);                   /* output it */
            p++;
        }
    }
    if (fp != stdin) fclose (fp);   /* close file if not stdin */

    return 0;
}

(output is the same) (输出是相同的)

Or if you prefer using indexes rather than a pointer, you could use: 或者,如果您更喜欢使用索引而不是指针,则可以使用:

    while (fgets (buf, MAXC, fp))               /* read each line in file */
        for (int i = 0; buf[i]; i++)            /* loop over each char */
            if (isalpha (buf[i]) || buf[i] == '\n') /* is it a-zA-Z or \n  */
                putchar (buf[i]);               /* output it */

(output is the same) (输出是相同的)

Look things over and let me know if you have questions. 仔细检查一下,如果您有任何问题,请告诉我。 If you do need to do it a word at a time, you will have to significantly add to your code to protect your number of pointers and realloc as required. 如果你需要做一个字的时间,你将有显著添加到您的代码,以保护您的指针数量和realloc要求。 Give me a sec and I'll help there, in the mean time, digest the basic character classification above. 给我一点时间,与此同时,我将为您提供帮助,以帮助您总结上面的基本字符分类。

Allocating And Storing Individual Words of Only Alpha-Characters 分配和存储仅包含字母字符的单词

As you can imaging, dynamically allocating pointers and then allocating for, and storing each word made up of only alpha-characters is a bit more involved. 正如您可以想象的那样,要动态地分配指针,然后分配并存储仅由字母字符组成的每个单词,这要复杂得多。 It's not any more difficult, your simply have to keep track of the number of pointers allocated, and if you have used all allocated pointers, reallocate and keep going. 这并没有什么困难,您只需跟踪分配的指针数量,如果您使用了所有分配的指针,请重新分配并继续前进。

The place where new C programmers usually get into trouble is failing to validate each required step to ensure each allocation succeeds to avoid writing to memory you don't own invoking Undefined Behavior . 新C程序员经常遇到麻烦的地方是无法验证每个必要步骤以确保每个分配成功,从而避免写入不由您自己调用的未定义行为的内存。

Reading individual words with fscanf is fine. fscanf读取单个单词很好。 Then to ensure you have alpha characters to store, it makes sense to extract the alpha characters into a separate temporary buffer and checking whether there were actually any stored, before allocating storage for that word. 然后,为确保您可以存储字母字符, 为该单词分配存储空间之前 ,有必要将字母字符提取到单独的临时缓冲区中并检查是否实际存储了任何字符。 The longest word in the non-medical unabridged dictionary is 29-characters, so a fixed buffer larger than that will suffice ( 1024 chars is used below -- Don't Skimp on Buffer Size! ) 非医学未删节词典中的最长单词为29个字符,因此大于此长度的固定缓冲区就足够了(下面使用1024字符- 不要跳过缓冲区大小!

So what you need for storing each word and tracking the number of pointers allocated and number of pointers used, as well as your fixed buffer to read into would be similar to: 因此,存储每个单词并跟踪分配的指针数量和使用的指针数量以及要读取的固定缓冲区所需的内容类似于:

#define NPTR    8   /* initial number of pointers */
#define MAXC 1024
...
    char **input,           /* pointers to words */
        buf[MAXC];          /* read buffer */
    size_t  nptr = NPTR,    /* number of allcoated pointers */
            used = 0;       /* number of used pointers */

After allocating your initial number of pointers you can read each word and then parse the alpha-characters from it similar to the following: 分配完初始数量的指针后,您可以读取每个单词,然后从中解析出字母字符,类似于以下内容:

    while (fscanf (fp, "%s", buf) == 1) {       /* read each word in file  */
        size_t ndx = 0;                         /* alpha char index */
        char tmp[MAXC];                         /* temp buffer for alpha */
        if (used == nptr)                       /* check if realloc needed */
            input = xrealloc2 (input, sizeof *input, &nptr);    /* realloc */
        for (int i = 0; buf[i]; i++)            /* loop over each char */
            if (isalpha (buf[i]))               /* is it a-zA-Z or \n  */
                tmp[ndx++] = buf[i];            /* store alpha chars */
        if (!ndx)                               /* if no alpha-chars */
            continue;                           /* get next word */
        tmp[ndx] = 0;                           /* nul-terminate chars */
        input[used] = dupstr (tmp);             /* allocate/copy tmp */
        if (!input[used]) {                     /* validate word storage */
            if (used)           /* if words already stored */
                break;          /* break, earlier words still good */
            else {              /* otherwise bail */
                fputs ("error: allocating 1st word.\n", stderr);
                return 1;
            }
        }
        used++;                                 /* increment used count */
    }

( note: when the number of used pointers equals the number allocated, then the input is reallocated to twice the current number of pointers) 注意:used指针数量等于分配的数量时, input将重新分配为当前指针数量的两倍)

The xrealloc2 and dupstr functions are simply helper functions. xrealloc2dupstr函数只是辅助函数。 xrealloc2 simply calls realloc and doubles the size of the current allocation, validating the allocation and returning the reallocated pointer on success, or currently exiting on failure -- you can change it to return NULL to handle the error if you like. xrealloc2只需调用realloc并加倍当前分配的大小,验证分配并在成功时返回已分配的指针,或者在失败时返回当前指针-您可以将其更改为返回NULL来处理错误(如果您愿意)。

/** realloc 'ptr' of 'nelem' of 'psz' to 'nelem * 2' of 'psz'.
 *  returns pointer to reallocated block of memory with new
 *  memory initialized to 0/NULL. return must be assigned to
 *  original pointer in caller.
 */
void *xrealloc2 (void *ptr, size_t psz, size_t *nelem)
{   void *memptr = realloc ((char *)ptr, *nelem * 2 * psz);
    if (!memptr) {
        perror ("realloc(): virtual memory exhausted.");
        exit (EXIT_FAILURE);
        /* return NULL; */
    }   /* zero new memory (optional) */
    memset ((char *)memptr + *nelem * psz, 0, *nelem * psz);
    *nelem *= 2;
    return memptr;
}

The dupstr function is just a normal strdup , but since not all compilers provide strdup , it is used to ensure portability. dupstr函数只是一个普通的strdup ,但由于并非所有编译器都提供strdup ,因此它用于确保可移植性。

/** allocate storage for s + 1 chars and copy contents of s
 *  to allocated block returning new sting on success,
 *  NULL otherwise.
 */
char *dupstr (const char *s)
{
    size_t len = strlen (s);
    char *str = malloc (len + 1);

    if (!str)
        return NULL;

    return memcpy (str, s, len + 1);
}

Using the helpers just keeps the main body of your code a tad cleaner rather than cramming it all into your loop. 使用助手只会使代码的主体保持一点清洁度,而不是将其全部塞入循环中。

Putting it altogether you could do: 完全可以做到:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>

#define NPTR    8   /* initial number of pointers */
#define MAXC 1024

void *xrealloc2 (void *ptr, size_t psz, size_t *nelem);
char *dupstr (const char *s);

int main (int argc, char **argv) {

    char **input,           /* pointers to words */
        buf[MAXC];          /* read buffer */
    size_t  nptr = NPTR,    /* number of allcoated pointers */
            used = 0;       /* number of used pointers */
    /* use filename provided as 1st argument (stdin by default) */
    FILE *fp = argc > 1 ? fopen (argv[1], "r") : stdin;

    if (!fp) {  /* validate file open for reading */
        perror ("file open failed");
        return 1;
    }

    input = malloc (nptr * sizeof *input);  /* allocate nptr pointers */
    if (!input) {                           /* validate every allocation */
        perror ("malloc-input");
        return 1;
    }

    while (fscanf (fp, "%s", buf) == 1) {       /* read each word in file  */
        size_t ndx = 0;                         /* alpha char index */
        char tmp[MAXC];                         /* temp buffer for alpha */
        if (used == nptr)                       /* check if realloc needed */
            input = xrealloc2 (input, sizeof *input, &nptr);    /* realloc */
        for (int i = 0; buf[i]; i++)            /* loop over each char */
            if (isalpha (buf[i]))               /* is it a-zA-Z or \n  */
                tmp[ndx++] = buf[i];            /* store alpha chars */
        if (!ndx)                               /* if no alpha-chars */
            continue;                           /* get next word */
        tmp[ndx] = 0;                           /* nul-terminate chars */
        input[used] = dupstr (tmp);             /* allocate/copy tmp */
        if (!input[used]) {                     /* validate word storage */
            if (used)           /* if words already stored */
                break;          /* break, earlier words still good */
            else {              /* otherwise bail */
                fputs ("error: allocating 1st word.\n", stderr);
                return 1;
            }
        }
        used++;                                 /* increment used count */
    }
    if (fp != stdin) fclose (fp);   /* close file if not stdin */

    for (size_t i = 0; i < used; i++) {
        printf ("word[%3zu]: %s\n", i, input[i]);
        free (input[i]);    /* free storage when done with word */
    }
    free (input);           /* free pointers */

    return 0;
}

/** realloc 'ptr' of 'nelem' of 'psz' to 'nelem * 2' of 'psz'.
 *  returns pointer to reallocated block of memory with new
 *  memory initialized to 0/NULL. return must be assigned to
 *  original pointer in caller.
 */
void *xrealloc2 (void *ptr, size_t psz, size_t *nelem)
{   void *memptr = realloc ((char *)ptr, *nelem * 2 * psz);
    if (!memptr) {
        perror ("realloc(): virtual memory exhausted.");
        exit (EXIT_FAILURE);
        /* return NULL; */
    }   /* zero new memory (optional) */
    memset ((char *)memptr + *nelem * psz, 0, *nelem * psz);
    *nelem *= 2;
    return memptr;
}

/** allocate storage for s + 1 chars and copy contents of s
 *  to allocated block returning new sting on success,
 *  NULL otherwise.
 */
char *dupstr (const char *s)
{
    size_t len = strlen (s);
    char *str = malloc (len + 1);

    if (!str)
        return NULL;

    return memcpy (str, s, len + 1);
}

(same input file is used) (使用相同的输入文件)

Example Use/Output 使用/输出示例

$ ./bin/readalphadyn ../dat/10intmess.txt
word[  0]: a
word[  1]: a
word[  2]: a
word[  3]: a
word[  4]: The
word[  5]: Quick
word[  6]: Brown
word[  7]: Fox
word[  8]: Jumps
word[  9]: Over
word[ 10]: A
word[ 11]: Lazy
word[ 12]: a
word[ 13]: Dog

Memory Use/Error Check 内存使用/错误检查

In any code you write that dynamically allocates memory, you have 2 responsibilities regarding any block of memory allocated: (1) always preserve a pointer to the starting address for the block of memory so, (2) it can be freed when it is no longer needed. 在您编写的任何可以动态分配内存的代码中,对于任何分配的内存块,您都有2个责任 :(1) 始终保留指向该内存块起始地址的指针,因此,(2)在没有内存块时可以将其释放需要更长的时间。

It is imperative that you use a memory error checking program to insure you do not attempt to access memory or write beyond/outside the bounds of your allocated block, attempt to read or base a conditional jump on an uninitialized value, and finally, to confirm that you free all the memory you have allocated. 必须使用一个内存错误检查程序来确保您不尝试访问内存或不要在已分配的块的边界之外/之外进行写入,不要尝试以未初始化的值读取或基于条件跳转,最后确定您可以释放已分配的所有内存。

For Linux valgrind is the normal choice. 对于Linux, valgrind是通常的选择。 There are similar memory checkers for every platform. 每个平台都有类似的内存检查器。 They are all simple to use, just run your program through it. 它们都很容易使用,只需通过它运行程序即可。

$ valgrind ./bin/readalphadyn ../dat/10intmess.txt
==8765== Memcheck, a memory error detector
==8765== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==8765== Using Valgrind-3.12.0 and LibVEX; rerun with -h for copyright info
==8765== Command: ./bin/readalphadyn ../dat/10intmess.txt
==8765==
word[  0]: a
word[  1]: a
word[  2]: a
word[  3]: a
word[  4]: The
word[  5]: Quick
word[  6]: Brown
word[  7]: Fox
word[  8]: Jumps
word[  9]: Over
word[ 10]: A
word[ 11]: Lazy
word[ 12]: a
word[ 13]: Dog
==8765==
==8765== HEAP SUMMARY:
==8765==     in use at exit: 0 bytes in 0 blocks
==8765==   total heap usage: 17 allocs, 17 frees, 796 bytes allocated
==8765==
==8765== All heap blocks were freed -- no leaks are possible
==8765==
==8765== For counts of detected and suppressed errors, rerun with: -v
==8765== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

Always confirm that you have freed all memory you have allocated and that there are no memory errors. 始终确认已释放已分配的所有内存,并且没有内存错误。

( note: There is no need to cast the return of malloc , it is unnecessary. See: Do I cast the result of malloc? ) 注意:无需malloc的返回值,这是不必要的。请参阅: 是否强制转换malloc的结果?

To skip single-character words (or pick the limit you want), you can simply change: 要跳过单字符单词(或选择所需的限制),只需更改即可:

    if (ndx < 2)                            /* if 0/1 alpha-chars */
        continue;                           /* get next word */

Doing that would changed your stored words to: 这样做会将您存储的单词更改为:

$ ./bin/readalphadyn ../dat/10intmess.txt
word[  0]: The
word[  1]: Quick
word[  2]: Brown
word[  3]: Fox
word[  4]: Jumps
word[  5]: Over
word[  6]: Lazy
word[  7]: Dog

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM