简体   繁体   中英

c++ specify dividers for reading words from text file

I have the following code prints each unique word and its count from a text file (contains >= 30k words), however it's separating words by whitespace, I had results like so:

在此处输入图片说明

how can I modify the code to specify the expected dividers?

template <class KTy, class Ty>
void PrintMap(map<KTy, Ty> map)
{
    typedef std::map<KTy, Ty>::iterator iterator;
    for (iterator p = map.begin(); p != map.end(); p++)
        cout << p->first << ": " << p->second << endl;
}

void UniqueWords(string fileName) {
    // Will store the word and count.
    map<string, unsigned int> wordsCount;

    // Begin reading from file:
    ifstream fileStream(fileName);

    // Check if we've opened the file (as we should have).
    if (fileStream.is_open())
        while (fileStream.good())
        {
            // Store the next word in the file in a local variable.
            string word;
            fileStream >> word;

            //Look if it's already there.
            if (wordsCount.find(word) == wordsCount.end()) // Then we've encountered the word for a first time.
                wordsCount[word] = 1; // Initialize it to 1.
            else // Then we've already seen it before..
                wordsCount[word]++; // Just increment it.
        }
    else  // We couldn't open the file. Report the error in the error stream.
    {
        cerr << "Couldn't open the file." << endl;
    }

    // Print the words map.
    PrintMap(wordsCount);
}

You can use a stream with a std::ctype<char> facet imbue() ed which considers whatever characters you fancy as space. Doing so would look something like this:

#include<locale>
#include<cctype>

struct myctype_table {
    std::ctype_base::mask table[std::ctype<char>::table_size];
    myctype_table(char const* spaces) {
        while (*spaces) {
            table[static_cast<unsigned char>(*spaces)] = std::ctype_base::isspace;
        }
    }
};
class myctype
    : private myctype_table,
    , public std::ctype<char> {
public:
    myctype(char const* spaces)
        : myctype_table(spaces)
        , std::ctype<char>(table) {
    };
};

int main() {
     std::locale myloc(std::locale(), new myctype(" \t\n\r?:.,!"));
     std::cin.imbue(myloc);
     for (std::string word; std::cin >> word; ) {
         // words are separated by the extended list of spaces
     }
}

This code isn't test right now - I'm typing on a mobile device. I probably misused some of the std::cypte<char> interfaces but something along those lines after fixing the names, etc. should work.

As you expect the forbidden characters at the end of the found word you can remove them prior to push the word into wordsCount:

if(word[word.length()-1] == ';' || word[word.length()-1] == ',' || ....){
   word.erase(word.length()-1);
}

After fileStream >> word; , you can call this function. Take a look and see if it's clear:

string adapt(string word) {
    string forbidden = "!?,.[];";
    string ret = "";
    for(int i = 0; i < word.size(); i++) {
        bool ok = true;
        for(int j = 0; j < forbidden.size(); j++) {
            if(word[i] == forbidden[j]) {
                ok = false;
                break;
            }
        }
        if(ok)
            ret.push_back(word[i]);
    }
    return ret;
}

Something like this:

fileStream >> word;
word = adapt(word);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM