简体   繁体   English

C语言中的词频统计(非C ++)

[英]Word Frequency Statistics in C (not C++)

Given a string consists of words separated by a single white space, print out the words in descending order sorted by the number of times they appear in the string. 给定一个字符串,该单词由单个空格分隔的单词组成,请按降序打印出单词,这些单词按它们出现在字符串中的次数排序。

For example an input string of “ab bc bc” would generate the following output: 例如,输入字符串“ ab bc bc”将生成以下输出:

bc : 2
ab : 1

The problem would be easily resolved if C++ data structures, like a map, is used. 如果使用C ++数据结构(如地图),则可以轻松解决该问题。 But if the problem could only be solved in plain old C, it looks much harder. 但是,如果只能在普通的旧C语言中解决此问题,则看起来会困难得多。

What kind of data structures and algorithms shall I use here? 我应该在这里使用哪种数据结构和算法? Please be as detailed as possible. 请尽可能详细。 I am weak in DS and Algo. 我在DS和Algo方面很弱。 :-( :-(

One data structure you could use is a simple binary tree that contains words you could compare using strcmp. 您可以使用的一种数据结构是一个简单的二进制树,其中包含可以使用strcmp进行比较的单词。 (I will ignore case issues for now). (我现在将忽略大小写问题)。

You will need to ensure the tree remains balanced as you grow it. 您需要确保树在生长时保持平衡。 For this look up AVL trees or 1-2 trees or red-black trees on wikipedia or elsewhere. 为此,请在Wikipedia或其他地方查找AVL树或1-2棵树或红黑树。

I will not give too much more detail except that to create a binary tree struct, each node would have a left and right sub-node which could be null, and for a leaf node, both sub-nodes are null. 除了创建二叉树结构之外,我将不提供更多详细信息,每个节点将有一个左右子节点,该子节点可以为空,对于叶节点,两个子节点都为空。 To make it simpler use an "intrusive" node that has the value and two sub-nodes. 为了使其更简单,请使用具有值和两个子节点的“侵入式”节点。 Something like: 就像是:

struct Node
{
  char * value;
  size_t frequency; 
  struct Node * left;
  struct Node * right;
};

and obviously being C you need to do all the memory management. 并且显然是C,您需要进行所有内存管理。

You will have a function that recurses down the tree, comparing and going left or right as appropriate. 您将具有沿树递归的功能,根据需要进行比较并向左或向右移动。 If found it will just up the frequency. 如果找到,它将增加频率。 If not your function should be able to determine the place at which to insert the node, and then comes your insertion and rebalancing logic. 如果不是,您的函数应该能够确定插入节点的位置,然后执行插入和重新平衡逻辑。 Of course the new node will contain the word with a frequency of 1. 当然,新节点将包含频率为1的单词。

At the end you will need a way to recurse through your tree printing the results. 最后,您将需要一种方法来遍历树,从而打印出结果。 In your case this can be a recursive function. 在您的情况下,这可以是递归函数。

Note by the way that an alternative data structure would be some kind of hash-table. 顺便说一下,替代数据结构将是某种哈希表。

If you are looking for the most efficient solution and have a lot of memory at hand, you would use a data structure whereby you branch through each letter as you encounter it. 如果您正在寻找最有效的解决方案并且手头有很多内存,则可以使用一种数据结构,以便在遇到每个字母时可以逐个跳转。 So the "a" gives you all the words beginning with a, then move to the second letter which is the "b" etc. It is rather complicated to implement for someone who doesn't know data structures so I would advise you to go with the simple binary tree. 因此,“ a”会给您所有以a开头的单词,然后移至第二个字母“ b”等。对于不了解数据结构的人来说,实现起来相当复杂,因此我建议您继续与简单的二叉树。

Note that in printing out, it would not be in reverse order of frequency so you would have to sort the results first. 请注意,在打印时,它的频率顺序不是相反的,因此您必须首先对结果进行排序。 (In C++ using map you also would not get them in that order). (在使用map的C ++中,您也不会按该顺序获取它们)。

I would use a ternary tree for this. 为此,我将使用三叉树。 The following article where the data structure is introduced by Jon Bentley and Robert Sedgewick has an example in C. 以下文章由乔恩·本特利(Jon Bentley)和罗伯特·塞奇威克(Robert Sedgewick)介绍了数据结构,其中有一个用C语言编写的示例。

http://www.cs.princeton.edu/~rs/strings/ http://www.cs.princeton.edu/~rs/strings/

Here's a sample of how I'd do it. 这是我如何做的一个例子。 The search in findWord() could be optimized. findWord()中的搜索可以进行优化。 The number of allocations can also be reduced by allocating blocks of words instead of one at a time. 也可以通过一次分配单词块而不是一次来减少分配数量。 One could implement a linked list for this case as well. 也可以为这种情况实现一个链表。 It is lacking memory deallocation. 它缺少内存释放。 This should hopefully get you going. 希望这可以帮助您前进。

    #include <stdio.h>
    #include <assert.h>
    #include <stdlib.h>

    #define MAXWORDLEN 128

    const char* findWhitespace(const char* text)
    {
        while (*text && !isspace(*text))
            text++;
        return text;
    }

    const char* findNonWhitespace(const char* text)
    {
        while (*text && isspace(*text))
            text++;
        return text;
    }

    typedef struct tagWord
    {
        char word[MAXWORDLEN + 1];
        int count;
    } Word;

    typedef struct tagWordList
    {
        Word* words;
        int count;
    } WordList;

    WordList* createWordList(unsigned int count);

    void extendWordList(WordList* wordList, const int count)
    {
        Word* newWords = (Word*)malloc(sizeof(Word) * (wordList->count + count));
        if (wordList->words != NULL) {
            memcpy(newWords, wordList->words, sizeof(Word)* wordList->count);
            free(wordList->words);
        }
        for (int i = wordList->count; i < wordList->count + count; i++) {
            newWords[i].word[0] = '\0';
            newWords[i].count = 0;
        }
        wordList->words = newWords;
        wordList->count += count;
    }

    void addWord(WordList* wordList, const char* word)
    {
        assert(strlen(word) <= MAXWORDLEN);
        extendWordList(wordList, 1);
        Word* wordNode = &wordList->words[wordList->count - 1];
        strcpy(wordNode->word, word);
        wordNode->count++;  
    }

    Word* findWord(WordList* wordList, const char* word)
    {
        for(int i = 0; i < wordList->count; i++) {
            if (stricmp(word, wordList->words[i].word) == 0) {
                return &wordList->words[i];
            }
        }
        return NULL;
    }

    void updateWordList(WordList* wordList, const char* word)
    {
        Word* foundWord = findWord(wordList, word);
        if (foundWord == NULL) {
            addWord(wordList, word);
        } else {
            foundWord->count++;
        }
    }

    WordList* createWordList(unsigned int count)
    {
        WordList* wordList = (WordList*)malloc(sizeof(WordList));
        if (count > 0) {
            wordList->words = (Word*)malloc(sizeof(Word) * count);
            for(unsigned int i = 0; i < count; i++) {
                wordList->words[i].count = 0;
                wordList->words[i].word[0] = '\0';
            }
        }
        else {
            wordList->words = NULL;
        }
        wordList->count = count;    
        return wordList;
    }

    void printWords(WordList* wordList)
    {
        for (int i = 0; i < wordList->count; i++) {
            printf("%s: %d\n", wordList->words[i].word, wordList->words[i].count);
        }
    }

    int compareWord(const void* vword1, const void* vword2)
    {
        Word* word1 = (Word*)vword1;
        Word* word2 = (Word*)vword2;
        return strcmp(word1->word, word2->word);
    }

    void sortWordList(WordList* wordList)
    {
        qsort(wordList->words, wordList->count, sizeof(Word), compareWord);
    }

    void countWords(const char* text)
    {
        WordList   *wordList = createWordList(0);
        Word       *foundWord = NULL;
        const char *beg = findNonWhitespace(text);
        const char *end;
        char       word[MAXWORDLEN];

        while (beg && *beg) {
            end = findWhitespace(beg);
            if (*end) {
                assert(end - beg <= MAXWORDLEN);
                strncpy(word, beg, end - beg);
                word[end - beg] = '\0';
                updateWordList(wordList, word);
                beg = findNonWhitespace(end);
            }
            else {
                beg = NULL;
            }
        }

        sortWordList(wordList);
        printWords(wordList);
    }

int main(int argc, char* argv[])
{
    char* text = "abc 123 abc 456 def 789 \tyup this \r\ncan work yup 456 it can";
    countWords(text);
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM