如何在 C++ 中使用正则表达式换行后不捕获空格

Question

I am trying to catch comments from c/c++/java files but I cannot find a way to skip whitespaces that may exist after a new line.我试图从 c/c++/java 文件中捕获注释，但我找不到跳过新行后可能存在的空格的方法。 My regex pattern is我的正则表达式模式是

regex reg("(//.*|/\\\\*(.|\\\\n)*?\\\\*/)");

For example in the following code (dont bother about the random code snippets, they could be anything...) I correctly catch comments:例如在下面的代码中（不要理会随机代码片段，它们可以是任何东西......）我正确地捕捉了评论：

// my  program in C++
#include <iostream>
/** playing around in
a new programming language **/
using namespace std;

and the output is:输出是：

// my  program in C++
/** playing around in
a new programming language **/

However, when i have code with whitespaces on a multiline comment like:但是，当我在多行注释上有带有空格的代码时，例如：

int main(){
        /* start always points to the first node of the linked list.
           temp is used to point to the last node of the linked list.*/
        node *start,*temp;
        start = (node *)malloc(sizeof(node));
        temp = start;
        temp -> next = NULL;
        temp -> prev = NULL;
        /* Here in this code, we take the first node as a dummy node.
           The first node does not contain data, but it used because to avoid handling special cases
           in insert and delete functions.
         */
        printf("1. Insert\n");

I capture:我捕获：

/* start always points to the first node of the linked list.
           temp is used to point to the last node of the linked list.*/
/* Here in this code, we take the first node as a dummy node.
           The first node does not contain data, but it used because to avoid handling special cases
           in insert and delete functions.
         */

instead of:代替：

/* start always points to the first node of the linked list.
temp is used to point to the last node of the linked list.*/
/* Here in this code, we take the first node as a dummy node.
The first node does not contain data, but it used because to avoid handling special cases
in insert and delete functions.
*/

How can I get around it within the regex pattern to avoid this?我怎样才能在正则表达式模式中绕过它来避免这种情况？

NOTE : If possible, I would like to avoid string manupulators etc, just with regex modification.注意：如果可能，我想避免使用字符串操作符等，只需修改正则表达式即可。

Answer 1

Converting my comment above.转换我上面的评论。

It is impossible to match discontinuous text.不可能匹配不连续的文本。 Instead, you can match a part of a text with a regex and then post-process the matched (or captured) value with another regex or with string manipulations.相反，您可以将文本的一部分与正则表达式匹配，然后使用另一个正则表达式或字符串操作对匹配（或捕获）的值进行后处理。

Here is an example (not the best, just to show the concept):这是一个例子（不是最好的，只是为了展示这个概念）：

string data("int main(){// Singleline content\n        /* start always points to the first node of the linked list.\n           temp is used to point to the last node of the linked list.*/\n        node *start,*temp;\n        start = (node *)malloc(sizeof(node));\n        temp = start;\n        temp -> next = NULL;\n        temp -> prev = NULL;\n        /* Here in this code, we take the first node as a dummy node.\n           The first node does not contain data, but it used because to avoid handling special cases\n           in insert and delete functions.\n         */\n        printf(\"1. Insert\n\");");
    //std::cout << "Data: " << data << std::endl;
    std::regex pattern(R"(//.*|/\*[^*]*\*+(?:[^/*][^*]*\*+)*/)");
    std::smatch result;

    while (regex_search(data, result, pattern)) {
        std::cout << std::regex_replace(result[0].str(), std::regex(R"((^|\n)[^\S\r\n]+)"), "$1") << std::endl;
        data = result.suffix().str();
    }

See the IDEONE demo查看IDEONE 演示

NOTE : Raw string literals simplify regex definition.注意：原始字符串文字简化了正则表达式定义。

The R"(//.*|/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/)" matches either // + any 0+ characters but a newline (singleline comments) and /\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/ matches /* followed with 0+ non- * s followed with 1+ * s that is followed with 0+ sequences of a character other than / and * and then 0+ non- * and then 1+ * s (multiline comments). R"(//.*|/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/)"匹配// + 任意 0+ 个字符但是换行符（单行注释）和/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/匹配/*后跟 0+ 非* s带有 1+ * s，后跟 0+ 字符序列，而不是/和* ，然后是 0+ 非* ，然后是 1+ * s（多行注释）。 This multiline comment is much more efficient than the one you have since it is written acc.这个多行注释比你的多行注释高效得多，因为它是写成 acc 的。 to the unroll-the-loop technique.到展开循环技术。

I removed the first horizontal whitespace(s) on a line with regex_replace(result[0].str(), std::regex(R"((^|\\n)[^\\S\\r\\n]+)"), "$1") : (^|\\n)[^\\S\\r\\n]+ matches and captures a start-of-string anchor or a newline followed with 1+ characters other than non-whitespace, CR, and LF.我用regex_replace(result[0].str(), std::regex(R"((^|\\n)[^\\S\\r\\n]+)"), "$1")删除了一行上的第一个水平空格regex_replace(result[0].str(), std::regex(R"((^|\\n)[^\\S\\r\\n]+)"), "$1") : (^|\\n)[^\\S\\r\\n]+匹配并捕获字符串开头的锚点或后跟 1 个以上字符的换行符，而不是非空白、CR 和如果。

如何在 C++ 中使用正则表达式换行后不捕获空格

问题描述

1 个解决方案

解决方案1
1 已采纳 2016-05-03 19:23:14

如何在 C++ 中使用正则表达式换行后不捕获空格

问题描述

1 个解决方案

解决方案1 1 已采纳 2016-05-03 19:23:14

解决方案1
1 已采纳 2016-05-03 19:23:14