简体   繁体   English

来自文本文件的单词计数C ++

[英]Word count from a text file c++

This is the content of file: 这是文件的内容:

12.34.
.3
3..3
.3.4
..8
.this
test.this
test.12.34
test1.this
test1.12.34

This is the expected output: 这是预期的输出:

COUNT | WORD 
------+------
   1  | .3
   1  | .3.4
   2  | 12.34
   2  | 3
   1  | 8
   2  | test
   1  | test1
   1  | test1.12.34
   3  | this

The requirement is reading each line from a text file then extract word from line. 要求是从文本文件中读取每一行,然后从该行中提取单词。 Whenever a new word is encountered, the program should allocate an instance of the node from dynamic memory to contain the word and its count and insert it into a linked list so that the list is always sorted. 每当遇到新单词时,程序应从动态内存中分配该节点的实例以包含该单词及其计数,然后将其插入链接列表,以便始终对列表进行排序。 If the word encountered already exists in the list, then the count for that word should be incremented. 如果列表中已经存在遇到的单词,则该单词的计数应增加。 Considering about the '.' 考虑“。” separator, if the . 分隔符(如果有)。 character has a space, tab, newline or digit on the left and a digit on the right then it is treated as a decimal point and thus part of a word. 字符左边有空格,制表符,换行符或数字,右边有数字,则将其视为小数点,因此是单词的一部分。 Otherwise it is treated as a full stop and a word separator. 否则,它将被视为句号和单词分隔符。

Words: are sequences of alphabetic and numeric characters, the single quote, the underscore and hyphen characters that are separated by sequences of one or more separator characters. 单词:是由一个或多个分隔符字符序列分隔的字母和数字字符序列,单引号,下划线和连字符。 See below for a list of the separator characters. 请参见下面的分隔符列表。 The input for this assignment will consist of words and integers and floating point numbers. 此分配的输入将由单词和整数以及浮点数组成。 The single quote character will always act as an apostrophe, and should be treated as part of a word. 单引号字符将始终充当撇号,并且应被视为单词的一部分。 Thus, streamer, streamers, streamer's and streamers' should all be distinct words, but "streamers" and streamers should count as two occurrences of the word streamers. 因此,飘带,飘带,飘带和飘带都应该是不同的词,但是“飘带”和飘带应该算作单词飘带的两次出现。

Apparently, I got something below, but Im still stuck in treating the period as a word separator. 显然,我在下面得到了一些东西,但是我仍然坚持将句点视为单词分隔符。 Could anyone suggest me some hints ? 有人可以给我一些提示吗?

bool isSeparator(const char c) {  
    if (std::isspace(c)) return true;

    const std::string pattern = ",;:\"~!#%^*()=+[]{}\\|<>?/";
    for (unsigned int i = 0; i < pattern.size(); i++) {
        if (pattern[i] == c) 
            return true;
    }
    return false;
}
void load(std::list<Node> &nodes, const char *file) {
    std::ifstream fin;
    std::string line = "";
    std::string word = "";

    fin.open(file);

    while (std::getline(fin, line)) {

        for (unsigned int i = 0; i < line.size(); i++) {
            if (isSeparator(line[i]) || i == (line.size() - 1)) {
                if (word.find('.') < word.size()) { // if there is a '.' in a word
                    if (word.find('.') == word.size() - 1) { // if '.' at the end of word
                        word.erase(word.find('.'), 1); // remove '.' in any case
                    }
                }
                if (word.size() != 0) {
                    nodes.push_back(Node(word));
                    word.clear();
                }
            } else {
                word += line[i];
            }
        }
    }

    fin.close();
}

Im just starting out c++ so, the assignment requires using only std::list to store node and some basic string manipulations. 我只是刚开始使用c ++,所以分配仅需要使用std :: list来存储节点和一些基本的字符串操作。

I have modified the function (isSeparator) you wrote and added a new function (isDigit): 我已经修改了您编写的函数(isSeparator),并添加了一个新函数(isDigit):

bool isSeparator(const char c) {
    const string pattern = ".,;:\"~!#%^*()=+[]{}\\|<>?/";
    for (unsigned int i = 0; i < pattern.size(); i++) {
        if (pattern[i] == c)
            return true;
    }
    return false;
}

bool isDigit(const char c) {
    if ((int) c >= 0x30 && (int) c <= 0x39) return true;
    else return false;
}

The new function isDigit is for determining if a character passed is a digit or not, I tried to gather all of the possible test cases that will make sure you are separating the words in the correct way, here are the cases I considered: 新功能isDigit用于确定传递的字符是否为数字,我尝试收集所有可能的测试用例,以确保您以正确的方式分隔单词,这是我考虑的情况:

word.12word.word
word.12.3word.word
word.word12.word
12.3.

for the load function I have modified the code, your part is to determine which code that inserts into the list node in my code and integrate it with your needs, here is the load function modified: 对于我已经修改了代码的load函数,您的部分是确定将哪些代码插入到我的代码的list节点中,并将其与您的需求集成,这是修改后的load函数:

ifstream fin;
    fin.open("file.in");
    string line, word = "";
    list<Node> node;
    while (getline(fin, line)) {
        for (unsigned int i=0; i<line.size(); i++) {
            if ((line[i] == '\t' || line[i] == ' ' || isDigit(line[i])) && (line[i+1]=='.' && isDigit(line[i+2]))) {
                word += line[i];
                word += ".";
                i+=2;
                while (!isSeparator(line[i])) word += line[i++];
                i--;
            } else if (!isSeparator(line[i])) {
                word += line[i];
                if (i==line.size()-1) {
                    node.push_back(Node(word));
                    //cout << word << endl; for debugging
                    word.clear();
                }
            } else {
                if (word.size() > 0) {
                    node.push_back(Node(word));
                    //cout << word << endl; for debugging
                    word.clear();
                }
            }
        }
    }
    fin.close();

here is the output: 这是输出:

word
12word
word
word
12.3word
word
word
word12
word
12.3

Here are the procedures you have to follow in order to solve these string matching problems: 为了解决这些字符串匹配问题,您必须遵循以下步骤:

1- First determine what are the possible cases, I guess the test cases I provided demonstrates that. 1-首先确定可能的情况,我想我提供的测试用例证明了这一点。
2- Start building your ifs statements according to the possible test cases/inputs. 2-根据可能的测试用例/输入开始构建ifs语句。
3- Try to lessen your ifs and group redundant ones. 3-尝试减少ifs并将多余的分组。
4- Last of all it all depends on your logic and on your way in thinking. 4-最后,这取决于您的逻辑和思维方式。

Good luck :) 祝好运 :)

Note I have used the using namespace std; 注意我使用过using namespace std; statement instead of embedding std:: each time. 而不是每次嵌入std ::语句。 Please correct me if I am wrong. 如果我错了,请纠正我。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM