简体   繁体   English

C-在不限长度的行中读取有限长度的单词

[英]C - Reading limited length words in unlimited length lines

I would like to read words from a file and know when a new line starts. 我想从文件中读取单词,并知道何时开始新的一行。

I know there can be three, four or zero words per line and the words cannot be longer than a certain length. 我知道每行可以有三个,四个或零个单词,并且单词不能超过一定长度。 But the line length with spaces is not bounded, so it is not possible to just read a line to a string, parse and continue. 但是带空格的行长度没有限制,因此不可能只读取一行到字符串,进行解析并继续。 I would like to know if there are three or four words in each line as I read it. 我想知道在阅读的每一行中是否有三个或四个单词。

Currently I use fscanf and some problem-specific internal logic to decide if the fourth word I read is in a new line or the fourth in the previous line. 目前,我使用fscanf和一些特定于问题的内部逻辑来确定我读取的第四个单词是换行还是上一行中的第四个单词。 But this way is fragile and easily broken. 但是这种方法很脆弱,很容易损坏。

I guess I could read char by char, ignore spaces and look for '\\n'. 我猜我可以逐字符读取char,忽略空格并查找'\\ n'。 Is there a more elegant way? 有没有更优雅的方式?

Thank you 谢谢

EDIT: I am limited to using C99 and standard libraries. 编辑:我仅限于使用C99和标准库。

Here is some code that does a job closely related to what you request. 这是一些与您的要求紧密相关的代码。 There are a couple of major differences: 有两个主要区别:

  1. It doesn't believe that the user knows what they're supplying as data has to obey certain rules, so it assumes that the user will abuse those rules. 它不相信用户知道他们要提供什么,因为数据必须遵守某些规则,因此它假定用户将滥用这些规则。
  2. Consequently, it records all words found on each line, recording the words at full length, and therefore using dynamic memory allocation. 因此,它将记录在每一行上找到的所有单词,并记录完整长度的单词,并因此使用动态内存分配。

It's been through some fairly acid testing before I posted it. 在我发布之前,它已经通过了一些相当严格的测试。 You compile with make UFLAGS=-DTEST to get shorter fragments of lines (64 bytes vs 4096 by default), and that also gives you extra diagnostic output. 您可以使用make UFLAGS=-DTEST进行编译,以获取更短的行片段(默认为64字节vs 4096),这也为您提供了额外的诊断输出。 I did a lot of testing with MAX_LINE_LEN at 6 instead of 64 — it was good for debugging problems with words continued over multiple fragments of a line. 我使用6而不是64 MAX_LINE_LEN进行了很多测试-这对于调试单词在一行的多个片段上连续出现的问题非常MAX_LINE_LEN

#include <assert.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

enum { MAX_WORD_CNT = 8 };

#ifdef TEST
static int debug = 1;
enum { MAX_LINE_LEN = 64 };
#else
static int debug = 0;
enum { MAX_LINE_LEN = 4096 };
#endif /* TEST */

typedef struct Word
{
    size_t length;
    char  *word;
} Word;

typedef struct WordList
{
    size_t  num_words;
    size_t  max_words;
    Word   *words;
} WordList;

typedef struct LineControl
{
    size_t   line_length;
    bool     part_word;
    size_t   part_len;
    WordList list;
} LineControl;

static void init_wordlist(WordList *list)
{
    list->num_words = 0;
    list->max_words = 0;
    list->words = 0;
}

static void free_wordlist(WordList *list)
{
    assert(list != 0);
    for (size_t i = 0; i < list->num_words; i++)
        free(list->words[i].word);
    free(list->words);
    init_wordlist(list);
}

static void extend_word(const char *extn, size_t ext_len, Word *word)
{
    if (debug)
        printf("old (%zu) = [%s]; extra (%zu) = [%.*s]\n", word->length, word->word,
                ext_len, (int)ext_len, extn);
    size_t space = word->length + ext_len + 1;
    char *new_space = realloc(word->word, space);
    if (new_space == 0)
    {
        fprintf(stderr, "failed to reallocate %zu bytes of memory\n", space);
        exit(EXIT_FAILURE);
    }
    word->word = new_space;
    memmove(word->word + word->length, extn, ext_len);
    word->length += ext_len;
    word->word[word->length] = '\0';
    if (debug)
        printf("new (%zu) = [%s]\n", word->length, word->word);
    }

static void addword_wordlist(const char *word, size_t word_len, WordList *list)
{
    if (list->num_words >= list->max_words)
    {
        assert(list->num_words == list->max_words);
        size_t new_max = list->max_words * 2 + 2;
        Word *new_words = realloc(list->words, new_max * sizeof(*new_words));
        if (new_words == 0)
        {
            fprintf(stderr, "failed to allocate %zu bytes of memory\n", new_max * sizeof(*new_words));
            exit(EXIT_FAILURE);
        }
        list->max_words = new_max;
        list->words = new_words;
    }
    list->words[list->num_words].word = malloc(word_len + 1);
    if (list->words[list->num_words].word == 0)
    {
        fprintf(stderr, "failed to allocate %zu bytes of memory\n", word_len + 1);
        exit(EXIT_FAILURE);
    }
    Word *wp = &list->words[list->num_words];
    wp->length = word_len;
    memmove(wp->word, word, word_len);
    wp->word[word_len] = '\0';
    list->num_words++;
}

static void init_linectrl(LineControl *ctrl)
{
    ctrl->line_length = 0;
    ctrl->part_word = false;
    ctrl->part_len = 0;
    init_wordlist(&ctrl->list);
}

static int parse_fragment(const char *line, LineControl *ctrl)
{
    char   whisp[] = " \t";
    size_t offset = 0;
    bool   got_eol = false;

    /* The only newline in the string is at the end, if it is there at all */
    assert(strchr(line, '\n') == strrchr(line, '\n'));
    assert(strchr(line, '\n') == 0 || *(strchr(line, '\n') + 1) == '\0');
    if (debug && ctrl->part_word)
    {
        assert(ctrl->list.num_words > 0);
        printf("Dealing with partial word on entry (%zu: [%s])\n",
               ctrl->part_len, ctrl->list.words[ctrl->list.num_words - 1].word);
    }

    size_t o_nonsp = 0;
    while (line[offset] != '\0')
    {
        size_t n_whisp = strspn(line + offset, whisp);
        size_t n_nonsp = strcspn(line + offset + n_whisp, whisp);
        if (debug)
            printf("offset %zu, whisp %zu, nonsp %zu\n", offset, n_whisp, n_nonsp);
        got_eol = false;
        ctrl->line_length += n_whisp + n_nonsp;
        if (line[offset + n_whisp + n_nonsp - 1] == '\n')
        {
            assert(n_nonsp > 0);
            got_eol = true;
            n_nonsp--;
        }
        if (n_whisp + n_nonsp == 0)
        {
            o_nonsp = 0;
            break;
        }

        if (n_whisp != 0)
        {
            ctrl->part_word = false;
            ctrl->part_len = 0;
        }

        /* Add words to list if the list is not already full */
        if (n_nonsp > 0)
        {
            const char *word = line + offset + n_whisp;
            if (ctrl->part_word)
            {
                assert(ctrl->list.num_words > 0);
                extend_word(word, n_nonsp,
                            &ctrl->list.words[ctrl->list.num_words - 1]);
            }
            else
            {
                addword_wordlist(word, n_nonsp, &ctrl->list);
            }
        }

        offset += n_whisp + n_nonsp;
        if (line[offset] != '\0')
        {
            ctrl->part_word = false;
            ctrl->part_len = 0;
        }
        o_nonsp = n_nonsp;
        if (got_eol)
            break;
    }

    /* Partial word detection */
    if (o_nonsp > 0 && !got_eol)
    {
        ctrl->part_word = true;
        ctrl->part_len += o_nonsp;
    }
    else
    {
        ctrl->part_word = false;
        ctrl->part_len = 0;
    }

    /* If seen newline; line complete */
    /* If No newline; line incomplete */
    return !got_eol;
}

int main(void)
{
    char line[MAX_LINE_LEN];
    size_t lineno = 0;

    while (fgets(line, sizeof(line), stdin) != 0)
    {
        LineControl ctrl;
        init_linectrl(&ctrl);
        lineno++;
        if (debug)
            printf("Line %zu: (%zu) [[%s]]\n", lineno, strlen(line), line);

        int extra = 0;
        while (parse_fragment(line, &ctrl) != 0 &&
               fgets(line, sizeof(line), stdin) != 0)
        {
            if (debug)
                printf("Extra %d for line %zu: (%zu) [[%s]]\n",
                       ++extra, lineno, strlen(line), line);
        }

        WordList *list = &ctrl.list;
        printf("Line %zu: length %zu, words = %zu\n",
               lineno, ctrl.line_length, list->num_words);
        size_t num_words = list->num_words;
        if (num_words > MAX_WORD_CNT)
            num_words = MAX_WORD_CNT;
        for (size_t i = 0; i < num_words; i++)
        {
            printf("  %zu: (%zu) %s\n",
                   i + 1, list->words[i].length, list->words[i].word);
        }
        putchar('\n');
        free_wordlist(&ctrl.list);
    }

    return 0;
}

I had a simpler version without the dynamic memory allocation but it didn't work properly when a word was split across two fragments of a line (so if the size of line fragment was 6 (5 characters plus null byte), and the maximum length of a word was 16, say, then the code ran into difficulties assembling the fragments. Consequently, I adopted a simpler approach — store all of every word. It isn't clear from the question what the maximum word sizes are. If the code should object to anything other than 0, 3 or 4 words, the data is available to make those complaints. If the code should object to words that are longer than some length such as 32, the data is available to make those complaints too. 我有一个没有动态内存分配的简单版本,但是当一个单词被分成一行的两个片段时,它无法正常工作(因此,如果一行片段的大小为6(5个字符加空字节),并且最大长度例如,一个单词的单词数是16,那么代码在组装片段时就遇到了麻烦。因此,我采用了一种更简单的方法-存储所有单词。从这个问题尚不清楚最大单词数是多少。如果应反对0、3或4个单词以外的任何东西,则可以使用这些数据进行投诉;如果代码应反对长度超过某些长度(例如32个)的单词,则该数据也可以用于进行这些抱怨。

One of the simpler test files is test-data.1 : 较简单的测试文件之一是test-data.1

    a b   
    a b      c         d                                                        

1123xxsdfdsfsfdsfdssa          1234ddfxxyff            frrrdds
1123dfdffdfdxxxxxxxxxas                        1234ydfyyyzm   knsaaass      1234asdafxxfrrrfrrrsaa    
               1123werwetrretttrretertre       aaaa     bbbbbb      ccccc        
k
                                                apoplectic-catastrophe-mongers-of-the-world-unite-for-you-have-nothing-to-lose-but-your-bad-temper                              apoplectic-catastrophe-mongers-of-the-world-unite-for-you-have-nothing-to-lose-but-your-bad-temper                                      apoplectic-catastrophe-mongers-of-the-world-unite-for-you-have-nothing-to-lose-but-your-bad-temper                                                  apoplectic-catastrophe-mongers-of-the-world-unite-for-you-have-nothing-to-lose-but-your-bad-temper                                                           

That has all sorts of tabs in it, as demonstrated by this version of the same data, where tabs are shown as \\t : 如其中的相同版本的数据所示,其中包含各种选项卡,其中选项卡显示为\\t

    a b   
    a b      c         d                                                        
\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t
1123xxsdfdsfsfdsfdssa          1234ddfxxyff            frrrdds
1123dfdffdfdxxxxxxxxxas                        1234ydfyyyzm   knsaaass      1234asdafxxfrrrfrrrsaa    
               1123werwetrretttrretertre       aaaa     bbbbbb      ccccc        
k
  \t\t \t \t\t\t \t \t \t\t\t\tapoplectic-catastrophe-mongers-of-the-world-unite-for-you-have-nothing-to-lose-but-your-bad-temper\t\t\t    \t\t\t\tapoplectic-catastrophe-mongers-of-the-world-unite-for-you-have-nothing-to-lose-but-your-bad-temper  \t \t \t \t\t\t\t \t \tapoplectic-catastrophe-mongers-of-the-world-unite-for-you-have-nothing-to-lose-but-your-bad-temper\t\t           \t\t\t\t \t \t \t \t\tapoplectic-catastrophe-mongers-of-the-world-unite-for-you-have-nothing-to-lose-but-your-bad-temper\t\t\t\t\t\t    \t \t \t \t      \t \t \t 

Running this awk script analyzes the data: 运行以下awk脚本可分析数据:

$ awk '{ printf "%3d %d [%s]\n", length($0) + 1, NF, $0 }' test-data.1
  1 0 []
  5 0 [    ]
 11 2 [    a b   ]
 81 4 [    a b      c         d                                                        ]
 20 0 [                                                     ]
 63 3 [1123xxsdfdsfsfdsfdssa          1234ddfxxyff            frrrdds]
103 4 [1123dfdffdfdxxxxxxxxxas                        1234ydfyyyzm   knsaaass      1234asdafxxfrrrfrrrsaa    ]
 82 4 [               1123werwetrretttrretertre       aaaa     bbbbbb      ccccc        ]
  2 1 [k]
494 4 [                                                 apoplectic-catastrophe-mongers-of-the-world-unite-for-you-have-nothing-to-lose-but-your-bad-temper                              apoplectic-catastrophe-mongers-of-the-world-unite-for-you-have-nothing-to-lose-but-your-bad-temper                                      apoplectic-catastrophe-mongers-of-the-world-unite-for-you-have-nothing-to-lose-but-your-bad-temper                      apoplectic-catastrophe-mongers-of-the-world-unite-for-you-have-nothing-to-lose-but-your-bad-temper                                           ]
$

The output from the program on that data file is: 该数据文件上程序的输出为:

Line 1: length 1, words = 0

Line 2: length 5, words = 0

Line 3: length 11, words = 2
  1: (1) a
  2: (1) b

Line 4: length 81, words = 4
  1: (1) a
  2: (1) b
  3: (1) c
  4: (1) d

Line 5: length 20, words = 0

Line 6: length 63, words = 3
  1: (21) 1123xxsdfdsfsfdsfdssa
  2: (12) 1234ddfxxyff
  3: (7) frrrdds

Line 7: length 103, words = 4
  1: (23) 1123dfdffdfdxxxxxxxxxas
  2: (12) 1234ydfyyyzm
  3: (8) knsaaass
  4: (22) 1234asdafxxfrrrfrrrsaa

Line 8: length 82, words = 4
  1: (25) 1123werwetrretttrretertre
  2: (4) aaaa
  3: (6) bbbbbb
  4: (5) ccccc

Line 9: length 2, words = 1
  1: (1) k

Line 10: length 494, words = 4
  1: (98) apoplectic-catastrophe-mongers-of-the-world-unite-for-you-have-nothing-to-lose-but-your-bad-temper
  2: (98) apoplectic-catastrophe-mongers-of-the-world-unite-for-you-have-nothing-to-lose-but-your-bad-temper
  3: (98) apoplectic-catastrophe-mongers-of-the-world-unite-for-you-have-nothing-to-lose-but-your-bad-temper
  4: (98) apoplectic-catastrophe-mongers-of-the-world-unite-for-you-have-nothing-to-lose-but-your-bad-temper

You can see the data from the awk script appearing in the output. 您可以看到awk脚本中的数据出现在输出中。

This code is available in my SOQ (Stack Overflow Questions) repository on GitHub as files scan59.c , test-data.1 , test-data.2 and test-data.3 in the /Users/jleffler/soq/src/so-5201-4002 sub-directory. 此代码是我提供SOQ (堆栈溢出问题)在GitHub存储库中的文件scan59.ctest-data.1test-data.2test-data.3/用户/ jleffler / SOQ / src目录/所以-5201-4002子目录。 The test-data.3 file, in particular, contains one line with 9955 characters, and 693 words — as well as other lines that are less stringent tests. 尤其是test-data.3文件,包含一行包含9955个字符和693个单词的行,以及不那么严格的测试的其他行。

The code runs compiles and runs cleanly on a Mac running macOS 10.13.6 High Sierra, using GCC 8.2.0 and Valgrind 3.14.0.GIT. 使用GCC 8.2.0和Valgrind 3.14.0.GIT,代码可以在运行macOS 10.13.6 High Sierra的Mac上运行并正常运行。 (Although the makefile stipulates C11, there is nothing in this code that is specific to C11; it is fully compatible with C99. It also compiles cleanly with make SFLAGS='-std=c99 -pedantic' .) (尽管makefile规定了C11,但是此代码中没有C11特有的东西;它与C99完全兼容。它也可以使用make SFLAGS='-std=c99 -pedantic'干净地编译。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM