简体   繁体   中英

Tokenization of a text file with frequency and line occurrence. Using C++

once again I ask for help. I haven't coded anything for sometime!

Now I have a text file filled with random gibberish. I already have a basic idea on how I will count the number of occurrences per word.

What really stumps me is how I will determine what line the word is in. Gut instinct tells me to look for the newline character at the end of each line. However I have to do this while going through the text file the first time right? Since if I do it afterwords it will do no good.

I already am getting the words via the following code:

vector<string> words;
string currentWord;

while(!inputFile.eof())
{
inputFile >> currentWord;
words.push_back(currentWord); 
}

This is for a text file with no set structure. Using the above code gives me a nice little(big) vector of words, but it doesn't give me the line they occur in.

Would I have to get the entire line, then process it into words to make this possible?

Use a std::map<std::string, int> to count the word occurrences -- the int is the number of times it exists.

If you need like by line input, use std::getline(std::istream&, std::string&) , like this:

std::vector<std::string> lines;
std::ifstream file(...) //Fill in accordingly.
std::string currentLine;
while(std::getline(file, currentLine))
    lines.push_back(currentLine);

You can split a line apart by putting it into an std::istringstream first and then using operator>> . (Alternately, you could cobble up some sort of splitter using std::find and other algorithmic primitaves)

EDIT: This is the same thing as in @dash-tom-bang's answer, but modified to be correct with respect to error handing:

vector<string> words;
int currentLine = 1; // or 0, however you wish to count...

string line;
while (getline(inputFile, line))
{
   istringstream inputString(line);
   string word;
   while (inputString >> word)
      words.push_back(pair(word, currentLine));
}

You're going to have to abandon reading into string s, because operator >>(istream&, string&) discards white space and the contents of the white space ( == '\\n' or != '\\n' , that is the question...) is what will give you line numbers.

This is where OOP can save the day. You need to write a class to act as a "front end" for reading from the file. Its job will be to buffer data from the file, and return words one at a time to the caller.

Internally, the class needs to read data from the file a block (say, 4096 bytes) at a time. Then a string GetWord() (yes, returning by value here is good) method will:

  • First, read any white space characters, taking care to increment the object's lineNumber member every time it hits a \\n .
  • Then read non-whitespace characters, putting them into the string object you'll be returning.
  • If it runs out of stuff to read, read the next block and continue.
  • If the you hit the end of file, the string you have is the whole word (which may be empty) and should be returned.
  • If the function returns an empty string, that tells the caller that the end of file has been reached. (Files usually end with whitespace characters, so reading whitespace characters cannot imply that there will be a word later on.)

Then you can call this method at the same place in your code as your cin >> line and the rest of the code doesn't need to know the details of your block buffering.

An alternative approach is to read things a line at a time, but all the read functions that would work for you require you to create a fixed-size buffer to read into beforehand, and if the line is longer than that buffer, you have to deal with it somehow. It could get more complicated than the class I described.

Short and sweet.

vector< map< string, size_t > > line_word_counts;

string line, word;
while ( getline( cin, line ) ) {
    line_word_counts.push_back();
    map< string, size_t > &word_counts = line_word_counts.back();

    istringstream line_is( line );
    while ( is >> word ) ++ word_counts[ word ];
}

cout << "'Hello' appears on line 5 " << line_word_counts[5-1]["Hello"]
     << " times\n";

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM