Python 正则表达式模式以查找包含在预处理器指令中的字符串

Question

I'm trying to read C++ source files with Python to extract loaded header files.我正在尝试使用 Python 读取 C++ 源文件以提取加载的 header 文件。
The header files are specified between #ifdef TYPEA and #else OR #endif . header 文件在#ifdef TYPEA和#else OR #endif之间指定。 If there is an #else -clause , the header files will always be specified before the #else -clause.如果有#else -clause ，则 header 文件将始终在#else -clause之前指定。

Let's assume an excerpt of the content of the source looks like this:让我们假设源内容的摘录如下所示：

source_content = '\n'.join([
    '#ifdef TYPEZERO',
    '  int someint = 42;',
    '#endif',
    'void abc ( int value) {',
    '  return 5 ** 2.5',
    '}',
    '',
    'abc',
    '#ifdef TYPEA',                                 # <---- begin identifier, may contain leading/trailing whitespaces
    '#include "some_header.h"',                     # <---- I want these lines
    '           #include "some_other_header23.h"',  # <---- I want these lines
    '           #else        ',                     # optional stop identifier, may contain leading/trailing whitespaces
    'double in_fact_int = 5;',                      # some irrelevant content
    '         #endif    ',                          # final stop identifier, may contain leading/trailing whitespaces
    '',
    'a = 5',
    '#ifdef TYPEB',
    '  abc = 23.5;',
    '#endif',
])

I'd like to extract lines with comments excluding #ifdef TYPEA , #else , #endif , such that my result is:我想提取带有除#ifdef TYPEA 、 #else 、 #endif之外的注释的行，这样我的结果是：

desired_match = '#include "some_header.h"\n           #include "some_other_header23.h"'

print(desired_match)
# Out:    #include "some_header.h"
# Out:               #include "some_other_header23.h"

Removing the whitespaces would be nice, but I'm fine with doing this separately from regex.删除空格会很好，但我可以与正则表达式分开执行此操作。

My current approach is:我目前的做法是：

import re

pattern = re.compile(
    r'(\s*.*)#ifdef(\s+)TYPEA(\s*)(.*?)(?=((\s*)#else|(\s*)#endif))',
    re.DOTALL
)
match = re.match(pattern, source_content)

print(match.group())
# Out:    #ifdef TYPEZERO
# Out:      int someint = 42;
# Out:    #endif
# Out:    void abc ( int value) {
# Out:      return 5 ** 2.5
# Out:    }
# Out:    
# Out:    abc
# Out:    #ifdef TYPEA
# Out:    #include "some_header.h"
# Out:               #include "some_other_header23.h"

This works fine with cutting off #else or #endif , but as you can see #ifdef TYPEA and, even worse, all preceding lines are also matched.这可以很好地切断#else或#endif ，但正如您所看到的#ifdef TYPEA ，更糟糕的是，所有前面的行也匹配。
If I remove the leading (\s*.*) from the pattern (or change it to (\s*) ), then I won't see any matching.如果我从模式中删除前导(\s*.*) （或将其更改为(\s*) ），那么我将看不到任何匹配。

How can I exclude the lines before #ifdef TYPEA and possibly also #ifdef TYPEA to get my desired match?如何排除#ifdef TYPEA之前的行以及可能还有#ifdef TYPEA以获得我想要的匹配？ Thanks in advance!提前致谢！

Answer 1

Here's a way using named groups.这是一种使用命名组的方法。 You had solved most of the problem already.你已经解决了大部分问题。

Notice the change in the regex to include the #ifdef... part within (?P<M>...) .请注意正则表达式中的更改以在(?P<M>...)中包含#ifdef...部分。

import re

source_content = '\n'.join([
    '#ifdef TYPEZERO',
    '  int someint = 42;',
    '#endif',
    'void abc ( int value) {',
    '  return 5 ** 2.5',
    '}',
    '',
    'abc',
    '#ifdef TYPEA',                                 # <---- begin identifier, may contain leading/trailing whitespaces
    '#include "some_header.h"',                     # <---- I want these lines
    '           #include "some_other_header23.h"',  # <---- I want these lines
    '           #else        ',                     # optional stop identifier, may contain leading/trailing whitespaces
    'double in_fact_int = 5;',                      # some irrelevant content
    '         #endif    ',                          # final stop identifier, may contain leading/trailing whitespaces
    '',
    'a = 5',
    '#ifdef TYPEB',
    '  abc = 23.5;',
    '#endif',
])

pattern = re.compile(
    r'(\s*.*)#ifdef(\s+)TYPEA(\s*)(?P<M>(.*?)(?=((\s*)#else|(\s*)#endif)))',
    re.DOTALL
)
match = re.match(pattern, source_content)

print(match.group( "M" ))

Answer 2

You can use re.search instead of re.match and use group numbers to get parts of your regex results.您可以使用re.search而不是re.match并使用组号来获取部分正则表达式结果。

pattern = re.compile(
    r'\s+#ifdef\s+TYPEA\s*(.*?)(?=(\s*#else|\s*#endif))',
    re.DOTALL
)
match = re.search(pattern, source_content)

print(match.group(1))

Is it solving your problem?它解决了你的问题吗？

Answer 3

If you only want what's between #ifdef TYPEA and #else or #endif , you can match the whole thing and create a group between those keywords.如果您只想要#ifdef TYPEA和#else或#endif之间的内容，则可以匹配整个内容并在这些关键字之间创建一个组。 re.findall will return the group: re.findall将返回组：

import re
comment_pattern = re.compile(r'#ifdef TYPEA(.*?)(?:#else|#endif)', re.MULTILINE | re.DOTALL)
print(*re.findall(comment_pattern, source_content), sep='\n-------------\n')

Output: Output：

#include "some_header.h"
           #include "some_other_header23.h"

Answer 4

You can use the following regex with re.search (note re.match only returns matches that are found at the start of a string, so re.search is more versatile):您可以将以下正则表达式与re.search一起使用（注意re.match仅返回在字符串开头找到的匹配项，因此re.search更加通用）：

#ifdef\s+TYPEA\s*(.*?)(?=\s*#(?:else|endif))

If you need multiple matches, you can plug this regex into a re.findall .如果您需要多个匹配项，可以将此正则表达式插入re.findall 。

See the regex demo .请参阅正则表达式演示。 Details :详情：

#ifdef - a fixed string #ifdef - 固定字符串
\s+ - one or more whitespaces \s+ - 一个或多个空格
TYPEA - a fixed string TYPEA - 固定字符串
\s* - zero or more whitespaces \s* - 零个或多个空格
(.*?) - Group 1: any zero or more chars as few as possible (also matches line break chars as re.DOTALL is used) (.*?) - 第 1 组：尽可能少的任何零个或多个字符（也匹配换行符，因为使用了re.DOTALL ）
(?=\s*#(?:else|endif)) - a positive lookahead that matches a location that is immediately followed with zero or more whitespaces and then either #else or #endif . (?=\s*#(?:else|endif)) - 与紧随其后的零个或多个空格然后是#else或#endif的位置匹配的正向前瞻。

See the Python demo, too .也请参见Python 演示。

Python 正则表达式模式以查找包含在预处理器指令中的字符串

问题描述

3 个解决方案

解决方案1
2 已采纳 2021-12-07 09:49:51

解决方案2
1 2021-12-07 09:41:32

解决方案3
1 2021-12-07 09:44:06

解决方案4
0 2021-12-15 22:58:37

Python 正则表达式模式以查找包含在预处理器指令中的字符串

问题描述

3 个解决方案

解决方案1 2 已采纳 2021-12-07 09:49:51

解决方案2 1 2021-12-07 09:41:32

解决方案3 1 2021-12-07 09:44:06

解决方案4 0 2021-12-15 22:58:37

解决方案1
2 已采纳 2021-12-07 09:49:51

解决方案2
1 2021-12-07 09:41:32

解决方案3
1 2021-12-07 09:44:06

解决方案4
0 2021-12-15 22:58:37