简体   繁体   中英

how to separate sentence[i] (string) into words(string) in c++

I have a problem. In my project, I take sentence line by line from dataset file which has one sentence each line. Then , I should separate sentences into words. But I couldn't find this how can I do.

This are the codes of class which will read from dataset:

class Input{
...
public:
string *word;
string *sentence;
Couple *couple;    // int x , int y  order of sentence and word
int number;
int line;
...
void readInput(string input);
}

This are the codes of read method:

void Input::readInput(string input)
{
cout << "Reading the " << input <<endl;

ifstream infile;
infile.open(input.c_str());

    if(!infile.is_open()){
    cerr << "Unable to open file: " << input << endl << endl;
    exit(-1);
}

for(int i=0; i<line ; i++){
    getline(infile, sentence[i]);
    //infile >> sentence[i];
}

for(int j=0;j<line ;j++){
// I want to separate sentences into words
}    

infile.close();
cout << "Finished Reading the " << input <<endl;

}

for(int j=0; j<line; ++j)
{
    std::istringstream iss(sentence[j]);
    for (std::string w; iss >> w; )
    {
        word[number] = w;
        ++number;
    }
}

You'll need to do something about punctuation though, if you don't want those attached to your words. Simple enough I think.

If it were me in the method where you have:

for(int j=0;j<line ;j++){
    // I want to separate sentences into words
}

I would use a regex to match all words in sentence[j] boost regex is a library I have used with great success in the past.

You can try to loop through the std::string representing each line by looking for end-of-word markers using std::string::find_first_of(). The parameter to find_first_of would the set of characters that are used to separate words in your text file(could be space, period etc.).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM