简体   繁体   中英

C++: How to extract words from string with regex

I want to extract words from a string. There are two methods I can think of that would accomplish this:

  1. Extraction by a delimiter.
  2. Extraction by word pattern searching.

Before I get into the specifics of my problem, I want to clarify that while I do ask about the methods of extraction and their implementations, the main focus of my problem is the regexes; not the implementations.

The words that I want to match can contain apostrophes (eg "Don't"), can be inside double or single quotes (apostrophes) (eg "Hello" and 'world') and a combination of the two (eg "Didn't" and 'Won't'). They can also contain numbers (eg "2017" and "U2") and underscores and hyphens (eg "hello_world" and "time-turner"). In-word apostrophes, underscores, and hyphens must be surrounded by other word characters. A final requirement is that strings containing random non-word characters (eg "Good mor¨+%g.") should still recognize all word-characters as words.

Example strings to extract words from and what I want the result to look like:

  1. "Hello, world!" should result in "Hello" and "world"
  2. "Aren't you clever?" should result in "Aren't" , "you" and "clever"
  3. "'Later', she said." should result in "Later" , "she" and "said"
  4. "'Maybe 5 o'clock?'" should result in "Maybe" , "5" and "o'clock"
  5. "In the year 2017 ..." should result in "In" , "the" , "year" and "2017"
  6. "G2g, cya l8r" should result in "G2g" , "cya" and "l8r"
  7. "hello_world.h" should result in "hello_world" and "h"
  8. "Hermione's time-turner." should result in "Hermione's" and "time-turner"
  9. "Good mor~+%g." should result in "Good" , "mor" and "g"
  10. "Hi' Testing_ Bye-" should result in "Hi" , "Testing" and "Bye"

Because – as far as I can tell – the two methods I proposed require quite different solutions I'll divide my question into two parts – one for each method.

1. Extraction by delimiter

This is the method I have dedicated the most of my time to develop, and I have found a partially working solution – however, I suspect the regex I am using is not very efficient. My solution is this (using Boost.Regex because its Perl syntax supports look behinds):

#include <string>
#include <vector>
#include <iostream>
#include <boost/regex.hpp>



std::vector<std::string> phrases({  "Hello, world!", "Aren't you clever?",
                                    "'Later', she said.", "'Maybe 5 o'clock?'",
                                    "In the year 2017 ...", "G2g, cya l8r",
                                    "hello_world.h", "Hermione's time-turner.",
                                    "Good mor~+%g.", "Hi' Testing_ Bye-"});
std::vector<std::string> words;

boost::regex delimiterPattern("^'|[\\W]*(?<=\\W)'+\\W*|(?!\\w+(?<!')'(?!')\\w+)[^\\w']+|'$");
boost::sregex_token_iterator end;
for (std::string phrase : phrases) {
    boost::sregex_token_iterator phraseIter(phrase.begin(), phrase.end(), delimiterPattern, -1);

    for ( ; phraseIter != end; phraseIter++) {
        words.push_back(*phraseIter);
        std::cout << words[words.size()-1] << std::endl;
    }
}

My largest problem with this solution is my regex, which I think looks too complex and could probably be done much better. It also doesn't correctly match apostrophes at the end of words – like in example 3. Here's a link to regex101.com with the regex and the example strings: Delimiter regex .

2. Extraction by word pattern searching

I haven't dedicated too much time to pursue this path myself and mainly included it as an alternative because my partial solution isn't necessarily the best one. My suggestion as to how to accomplish this would be to do something in the vein of repeatedly searching a string for a pattern, removing each match from the string as you go until there are no more matches. I have a working regex for this method, but would still like input on it: "[A-Za-z0-9]+(['_-]?[A-Za-z0-9]+)?" . Here's a link to regex101.com with the regex and the example strings: Word pattern regex .

I want to emphasize again that I first and foremost want input on my regexes, but also appreciate help with implementing the methods.


Edit: Thanks @Galik for pointing out that possesive plurals can end in apostrophes. The apostrophes associated with these may be matched in a delimiter and do not have to be matched in a word pattern (ie "The kids' toys" should result in "The" , "kids" and "toys" ).

You may use

[^\W_]+(?:['_-][^\W_]+)*

See the regex demo .

Pattern details :

  • [^\\W_]+ - one or more chars other than non-word chars and _ (matches alphanumeric chars)
  • (?: - start of a non-capturing group that only groups subpatterns and matches:
    • ['_-] - a ' , _ or -
    • [^\\W_]+ - 1+ alphanumeric chars
  • )* - repeats the group zero or more times.

C++ demo :

std::regex r(R"([^\W_]+(?:['_-][^\W_]+)*)");
std::string s = "Hello, world! Aren't you clever? 'Later', she said. Maybe 5 o'clock?' In the year 2017 ... G2g, cya l8r hello_world.h Hermione's time-turner. Good mor~+%g. Hi' Testing_ Bye- The kids' toys";
for(std::sregex_iterator i = std::sregex_iterator(s.begin(), s.end(), r);
                         i != std::sregex_iterator();
                         ++i)
{
    std::smatch m = *i;
    std::cout << m.str() << '\n';
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM