简体   繁体   English

C ++:如何使用正则表达式从字符串中提取单词

[英]C++: How to extract words from string with regex

I want to extract words from a string. 我想从字符串中提取单词。 There are two methods I can think of that would accomplish this: 我可以想到两种方法可以完成此任务:

  1. Extraction by a delimiter. 用定界符提取。
  2. Extraction by word pattern searching. 通过单词模式搜索提取。

Before I get into the specifics of my problem, I want to clarify that while I do ask about the methods of extraction and their implementations, the main focus of my problem is the regexes; 在深入探讨问题之前,我想澄清一下,尽管我确实询问了提取方法及其实现,但问题的主要焦点是正则表达式。 not the implementations. 不是实现。

The words that I want to match can contain apostrophes (eg "Don't"), can be inside double or single quotes (apostrophes) (eg "Hello" and 'world') and a combination of the two (eg "Didn't" and 'Won't'). 我要匹配的单词可以包含撇号(例如“ Do n't”),可以在双引号或单引号(撇号)(例如“ Hello”和“ world”)之内,也可以是两者的组合(例如“ Didn” t”和“不会”)。 They can also contain numbers (eg "2017" and "U2") and underscores and hyphens (eg "hello_world" and "time-turner"). 它们还可以包含数字(例如“ 2017”和“ U2”)以及下划线和连字符(例如“ hello_world”和“ time-turner”)。 In-word apostrophes, underscores, and hyphens must be surrounded by other word characters. 单词中的撇号,下划线和连字符必须由其他单词字符包围。 A final requirement is that strings containing random non-word characters (eg "Good mor¨+%g.") should still recognize all word-characters as words. 最后一个要求是,包含随机非单词字符(例如“ Goodmor¨+%g。”)的字符串仍应将所有单词字符识别为单词。

Example strings to extract words from and what I want the result to look like: 从中提取单词的示例字符串以及我想要的结果看起来像什么:

  1. "Hello, world!" should result in "Hello" and "world" 应该导致"Hello""world"
  2. "Aren't you clever?" should result in "Aren't" , "you" and "clever" 应该导致"Aren't""you""clever"
  3. "'Later', she said." should result in "Later" , "she" and "said" 应显示"Later""she""said"
  4. "'Maybe 5 o'clock?'" should result in "Maybe" , "5" and "o'clock" "'Maybe 5 o'clock?'"应显示为"Maybe""5""o'clock"
  5. "In the year 2017 ..." should result in "In" , "the" , "year" and "2017" "In the year 2017 ..."结果应为"In""the""year""2017"
  6. "G2g, cya l8r" should result in "G2g" , "cya" and "l8r" "G2g, cya l8r"应导致"G2g""cya""l8r"
  7. "hello_world.h" should result in "hello_world" and "h" "hello_world.h"应导致"hello_world""h"
  8. "Hermione's time-turner." should result in "Hermione's" and "time-turner" 应该导致"Hermione's""time-turner"
  9. "Good mor~+%g." should result in "Good" , "mor" and "g" 应产生"Good""mor""g"
  10. "Hi' Testing_ Bye-" should result in "Hi" , "Testing" and "Bye" "Hi' Testing_ Bye-"应显示为"Hi""Testing""Bye"

Because – as far as I can tell – the two methods I proposed require quite different solutions I'll divide my question into two parts – one for each method. 因为-就我所知-我提出的两种方法需要完全不同的解决方案,因此我将问题分为两部分-每种方法一个。

1. Extraction by delimiter 1.用定界符提取

This is the method I have dedicated the most of my time to develop, and I have found a partially working solution – however, I suspect the regex I am using is not very efficient. 这是我大部分时间用于开发的方法,并且找到了部分可行的解决方案-但是,我怀疑我使用的正则表达式不是非常有效。 My solution is this (using Boost.Regex because its Perl syntax supports look behinds): 我的解决方案是这样的(使用Boost.Regex,因为它的Perl语法支持回头看):

#include <string>
#include <vector>
#include <iostream>
#include <boost/regex.hpp>



std::vector<std::string> phrases({  "Hello, world!", "Aren't you clever?",
                                    "'Later', she said.", "'Maybe 5 o'clock?'",
                                    "In the year 2017 ...", "G2g, cya l8r",
                                    "hello_world.h", "Hermione's time-turner.",
                                    "Good mor~+%g.", "Hi' Testing_ Bye-"});
std::vector<std::string> words;

boost::regex delimiterPattern("^'|[\\W]*(?<=\\W)'+\\W*|(?!\\w+(?<!')'(?!')\\w+)[^\\w']+|'$");
boost::sregex_token_iterator end;
for (std::string phrase : phrases) {
    boost::sregex_token_iterator phraseIter(phrase.begin(), phrase.end(), delimiterPattern, -1);

    for ( ; phraseIter != end; phraseIter++) {
        words.push_back(*phraseIter);
        std::cout << words[words.size()-1] << std::endl;
    }
}

My largest problem with this solution is my regex, which I think looks too complex and could probably be done much better. 这个解决方案最大的问题是我的正则表达式,我认为它看起来太复杂了,可能会做得更好。 It also doesn't correctly match apostrophes at the end of words – like in example 3. Here's a link to regex101.com with the regex and the example strings: Delimiter regex . 它也不能正确匹配单词结尾处的撇号-如示例3中所示。这是带有regex和示例字符串的regex101.com链接: Delimiter regex

2. Extraction by word pattern searching 2.通过单词模式搜索提取

I haven't dedicated too much time to pursue this path myself and mainly included it as an alternative because my partial solution isn't necessarily the best one. 我自己没有花太多时间去追求这条路,主要是将它作为替代方案,因为我的部分解决方案不一定是最好的解决方案。 My suggestion as to how to accomplish this would be to do something in the vein of repeatedly searching a string for a pattern, removing each match from the string as you go until there are no more matches. 我对如何完成此操作的建议是,按照重复搜索字符串的方式进行操作,并在操作过程中从字符串中删除每个匹配项,直到不再有匹配项为止。 I have a working regex for this method, but would still like input on it: "[A-Za-z0-9]+(['_-]?[A-Za-z0-9]+)?" 我对此方法有一个有效的正则表达式,但仍想输入: "[A-Za-z0-9]+(['_-]?[A-Za-z0-9]+)?" . Here's a link to regex101.com with the regex and the example strings: Word pattern regex . 这是带有regex和示例字符串的regex101.com的链接: 单词模式regex

I want to emphasize again that I first and foremost want input on my regexes, but also appreciate help with implementing the methods. 我想再次强调,我首先要在我的正则表达式上输入内容,但也希望对实现这些方法有所帮助。


Edit: Thanks @Galik for pointing out that possesive plurals can end in apostrophes. 编辑:感谢@Galik指出所有格可以以撇号结尾。 The apostrophes associated with these may be matched in a delimiter and do not have to be matched in a word pattern (ie "The kids' toys" should result in "The" , "kids" and "toys" ). 与它们相关的撇号可以在定界符中匹配,而不必在单词模式中匹配(即, "The kids' toys"应生成"The""kids""toys" )。

You may use 您可以使用

[^\W_]+(?:['_-][^\W_]+)*

See the regex demo . 参见regex演示

Pattern details : 图案细节

  • [^\\W_]+ - one or more chars other than non-word chars and _ (matches alphanumeric chars) [^\\W_]+ -除非单词字符和_以外的一个或多个字符(与字母数字字符匹配)
  • (?: - start of a non-capturing group that only groups subpatterns and matches: (?: -非捕获组的开始,该组仅将子模式和匹配项分组:
    • ['_-] - a ' , _ or - ['_-] -a '_-
    • [^\\W_]+ - 1+ alphanumeric chars [^\\W_]+ -1+个字母数字字符
  • )* - repeats the group zero or more times. )* -将群组重复零次或多次。

C++ demo : C ++演示

std::regex r(R"([^\W_]+(?:['_-][^\W_]+)*)");
std::string s = "Hello, world! Aren't you clever? 'Later', she said. Maybe 5 o'clock?' In the year 2017 ... G2g, cya l8r hello_world.h Hermione's time-turner. Good mor~+%g. Hi' Testing_ Bye- The kids' toys";
for(std::sregex_iterator i = std::sregex_iterator(s.begin(), s.end(), r);
                         i != std::sregex_iterator();
                         ++i)
{
    std::smatch m = *i;
    std::cout << m.str() << '\n';
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM