简体   繁体   中英

Word count from a text file c++

This is the content of file:

12.34.
.3
3..3
.3.4
..8
.this
test.this
test.12.34
test1.this
test1.12.34

This is the expected output:

COUNT | WORD 
------+------
   1  | .3
   1  | .3.4
   2  | 12.34
   2  | 3
   1  | 8
   2  | test
   1  | test1
   1  | test1.12.34
   3  | this

The requirement is reading each line from a text file then extract word from line. Whenever a new word is encountered, the program should allocate an instance of the node from dynamic memory to contain the word and its count and insert it into a linked list so that the list is always sorted. If the word encountered already exists in the list, then the count for that word should be incremented. Considering about the '.' separator, if the . character has a space, tab, newline or digit on the left and a digit on the right then it is treated as a decimal point and thus part of a word. Otherwise it is treated as a full stop and a word separator.

Words: are sequences of alphabetic and numeric characters, the single quote, the underscore and hyphen characters that are separated by sequences of one or more separator characters. See below for a list of the separator characters. The input for this assignment will consist of words and integers and floating point numbers. The single quote character will always act as an apostrophe, and should be treated as part of a word. Thus, streamer, streamers, streamer's and streamers' should all be distinct words, but "streamers" and streamers should count as two occurrences of the word streamers.

Apparently, I got something below, but Im still stuck in treating the period as a word separator. Could anyone suggest me some hints ?

bool isSeparator(const char c) {  
    if (std::isspace(c)) return true;

    const std::string pattern = ",;:\"~!#%^*()=+[]{}\\|<>?/";
    for (unsigned int i = 0; i < pattern.size(); i++) {
        if (pattern[i] == c) 
            return true;
    }
    return false;
}
void load(std::list<Node> &nodes, const char *file) {
    std::ifstream fin;
    std::string line = "";
    std::string word = "";

    fin.open(file);

    while (std::getline(fin, line)) {

        for (unsigned int i = 0; i < line.size(); i++) {
            if (isSeparator(line[i]) || i == (line.size() - 1)) {
                if (word.find('.') < word.size()) { // if there is a '.' in a word
                    if (word.find('.') == word.size() - 1) { // if '.' at the end of word
                        word.erase(word.find('.'), 1); // remove '.' in any case
                    }
                }
                if (word.size() != 0) {
                    nodes.push_back(Node(word));
                    word.clear();
                }
            } else {
                word += line[i];
            }
        }
    }

    fin.close();
}

Im just starting out c++ so, the assignment requires using only std::list to store node and some basic string manipulations.

I have modified the function (isSeparator) you wrote and added a new function (isDigit):

bool isSeparator(const char c) {
    const string pattern = ".,;:\"~!#%^*()=+[]{}\\|<>?/";
    for (unsigned int i = 0; i < pattern.size(); i++) {
        if (pattern[i] == c)
            return true;
    }
    return false;
}

bool isDigit(const char c) {
    if ((int) c >= 0x30 && (int) c <= 0x39) return true;
    else return false;
}

The new function isDigit is for determining if a character passed is a digit or not, I tried to gather all of the possible test cases that will make sure you are separating the words in the correct way, here are the cases I considered:

word.12word.word
word.12.3word.word
word.word12.word
12.3.

for the load function I have modified the code, your part is to determine which code that inserts into the list node in my code and integrate it with your needs, here is the load function modified:

ifstream fin;
    fin.open("file.in");
    string line, word = "";
    list<Node> node;
    while (getline(fin, line)) {
        for (unsigned int i=0; i<line.size(); i++) {
            if ((line[i] == '\t' || line[i] == ' ' || isDigit(line[i])) && (line[i+1]=='.' && isDigit(line[i+2]))) {
                word += line[i];
                word += ".";
                i+=2;
                while (!isSeparator(line[i])) word += line[i++];
                i--;
            } else if (!isSeparator(line[i])) {
                word += line[i];
                if (i==line.size()-1) {
                    node.push_back(Node(word));
                    //cout << word << endl; for debugging
                    word.clear();
                }
            } else {
                if (word.size() > 0) {
                    node.push_back(Node(word));
                    //cout << word << endl; for debugging
                    word.clear();
                }
            }
        }
    }
    fin.close();

here is the output:

word
12word
word
word
12.3word
word
word
word12
word
12.3

Here are the procedures you have to follow in order to solve these string matching problems:

1- First determine what are the possible cases, I guess the test cases I provided demonstrates that.
2- Start building your ifs statements according to the possible test cases/inputs.
3- Try to lessen your ifs and group redundant ones.
4- Last of all it all depends on your logic and on your way in thinking.

Good luck :)

Note I have used the using namespace std; statement instead of embedding std:: each time. Please correct me if I am wrong.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM