简体   繁体   English

C ++正则表达式排除双引号不起作用

[英]C++ regex exclusion double quotes not working

I am considering input files with lines like 我正在考虑输入文件,例如

"20170103","MW JANE DOE","NL01 INGB 1234 5678 90","NL02 INGB 1234 5678 90","GT","Af","12,34","Internetbankieren","Mededeling_3"
"20170102","MW JANE DOE","NL01 INGB 1234 5678 90","NL02 INGB 1234 5678 90","GT","Af","12,34","Internetbankieren","Mededeling_2"
"20170101","MW JANE DOE","NL01 INGB 1234 5678 90","NL02 INGB 1234 5678 90","GT","Af","12,34","Internetbankieren","Mededeling_1"

. I want to get the separate strings WITHOUT THE DOUBLE QUOTES and store them in std::vector<std::string> . 我想得到没有双引号的单独的字符串,并将它们存储在std::vector<std::string> So, for instance, I want to have 20170101 , MW JANE DOE , NL01 INGB 1234 5678 90 , NL02 INGB 1234 5678 90 , GT , Af , 12,34 , Internetbankieren , and Mededeling_1 as a result. 因此,例如,我想要结果为20170101MW JANE DOENL01 INGB 1234 5678 90NL02 INGB 1234 5678 90GTAf12,34InternetbankierenMededeling_1

I try to do so with the code 我尝试用代码来做到这一点

std::regex re("\"(.*?)\"");
std::regex_iterator<std::string::iterator> it (line.begin(),line.end(),re);
std::regex_iterator<std::string::iterator> end;
std::vector<std::string> lineParts;
std::string linePart="";

// Split 'line' into line parts and save these in the vector 'lineParts'.
while (it!=end)
{
    linePart=it->str();
    std::cout<<linePart<<std::endl; // Print substring.
    lineParts.push_back(linePart);
    ++it;
}

However, the double quotes are still included in the elements of lineParts , even though I used the regex "\\"(.*?)\\"" so that supposedly only the part within the double quotes is saved, and not the double quotes themselves. 但是,即使我使用了正则表达式"\\"(.*?)\\"" ,双引号仍包含在lineParts的元素中,因此,假定只保存了双引号中的部分,而不保存了双引号本身。

What am I doing wrong? 我究竟做错了什么?

You have a pattern with a capturing group . 您有一个带有捕获组的模式。 So, when your regex finds a match, the double quotes are part of the whole match value (that is stored in the [0] th element), but the captured part is stored in the [1] th element. 因此,当您的正则表达式找到匹配项时,双引号是整个匹配值的一部分(存储在第[0]个元素中),但是捕获的部分存储在第[1]个元素中。

So, you just need to access capturing group #1 contents: 因此,您只需要访问捕获组1的内容:

linePart=it->str(1);

See regular-expressions.info Finding a Regex Match : 请参阅regular-expressions.info 查找正则表达式匹配项

When the function call returns true, you can call the str() , position() , and length() member functions of the match_results object to get the text that was matched, or the starting position and its length of the match relative to the subject string. 当函数调用返回true时,您可以调用match_results对象的str()position()length()成员函数以获取匹配的文本,或匹配的起始位置及其长度(相对于主题字符串。 Call these member functions without a parameter or with 0 as the parameter to get the overall regex match. 调用这些不带参数或以0为参数的成员函数,以获得整体正则表达式匹配。 Call them passing 1 or greater to get the match of a particular capturing group. 称他们通过1或更大,以获取特定捕获组的匹配。 The size() member function indicates the number of capturing groups plus one for the overall match. size()成员函数指示捕获组的数量加一个用于整体匹配的组。 Thus you can pass a value up to size()-1 to the other three member functions. 因此,您可以将size()-1的值传递给其他三个成员函数。

As others have said, regex_iterator::operator-> returns a match_results and match_results::str is defaulted to 0: 正如其他人所说, regex_iterator::operator->返回match_results并且match_results::str默认为0:

The first sub_match (index 0 ) contained in a match_result always represents the full match within a target sequence made by a regex , and subsequent sub_matches represent sub-expression matches corresponding in sequence to the left parenthesis delimiting the sub-expression in the regex match_result包含的第一个sub_match (索引0 )始终表示由regex生成的目标序列中的完全匹配,随后的sub_matches表示子表达式匹配,该子表达式匹配顺序与左括号相对应,从而限定了regex的子regex

So the problem with your code is you're not using linePart = it->str(1) . 因此,代码的问题是您没有使用linePart = it->str(1)

A better solution would be to use a regex_token_iterator . 更好的解决方案是使用regex_token_iterator With whitch you could just use your re to directly initialize lineParts : 使用whitch,您可以只使用re直接初始化lineParts

vector<string> lineParts { sregex_token_iterator(cbegin(line), cend(line), re, 1), sregex_tokent_iterator() };

But I'd just like to point out that introduced quoted does exactly what you're trying to do here, and more (it even handles escaped quotes for you!) It'd just be a shame not to use it. 但是我想指出的是,引入quoted 确实可以满足您在此处要执行的操作,甚至还有更多(它甚至可以为您处理转义的引号!)不使用它只是可耻的。

You probably are already getting your input from a stream, but just in the case you're not you'd need to initialize an istringstream , for the purposes of example I'll call mine: line . 您可能已经从流中获取了输入,但是就您而言,您不需要初始化istringstream ,就示例而言,我将其称为mine: line Then you can use quoted to populate lineParts like this: 然后,您可以使用quoted填充lineParts如下所示:

for(string linePart; line >> quoted(linePart); line.ignore(numeric_limits<streamsize>::max(), ',')) {
    lineParts.push_back(linePart);
}

Live Example 现场例子

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM