简体   繁体   English

Python 正则表达式模式以查找包含在预处理器指令中的字符串

[英]Python regex pattern to find string enclosed in preprocessor directives

I'm trying to read C++ source files with Python to extract loaded header files.我正在尝试使用 Python 读取 C++ 源文件以提取加载的 header 文件。
The header files are specified between #ifdef TYPEA and #else OR #endif . header 文件在#ifdef TYPEA#else OR #endif之间指定。 If there is an #else -clause , the header files will always be specified before the #else -clause.如果有#else -clause ,则 header 文件将始终#else -clause之前指定。

Let's assume an excerpt of the content of the source looks like this:让我们假设源内容的摘录如下所示:

source_content = '\n'.join([
    '#ifdef TYPEZERO',
    '  int someint = 42;',
    '#endif',
    'void abc ( int value) {',
    '  return 5 ** 2.5',
    '}',
    '',
    'abc',
    '#ifdef TYPEA',                                 # <---- begin identifier, may contain leading/trailing whitespaces
    '#include "some_header.h"',                     # <---- I want these lines
    '           #include "some_other_header23.h"',  # <---- I want these lines
    '           #else        ',                     # optional stop identifier, may contain leading/trailing whitespaces
    'double in_fact_int = 5;',                      # some irrelevant content
    '         #endif    ',                          # final stop identifier, may contain leading/trailing whitespaces
    '',
    'a = 5',
    '#ifdef TYPEB',
    '  abc = 23.5;',
    '#endif',
])

I'd like to extract lines with comments excluding #ifdef TYPEA , #else , #endif , such that my result is:我想提取带有除#ifdef TYPEA#else#endif之外的注释的行,这样我的结果是:

desired_match = '#include "some_header.h"\n           #include "some_other_header23.h"'

print(desired_match)
# Out:    #include "some_header.h"
# Out:               #include "some_other_header23.h"

Removing the whitespaces would be nice, but I'm fine with doing this separately from regex.删除空格会很好,但我可以与正则表达式分开执行此操作。

My current approach is:我目前的做法是:

import re

pattern = re.compile(
    r'(\s*.*)#ifdef(\s+)TYPEA(\s*)(.*?)(?=((\s*)#else|(\s*)#endif))',
    re.DOTALL
)
match = re.match(pattern, source_content)

print(match.group())
# Out:    #ifdef TYPEZERO
# Out:      int someint = 42;
# Out:    #endif
# Out:    void abc ( int value) {
# Out:      return 5 ** 2.5
# Out:    }
# Out:    
# Out:    abc
# Out:    #ifdef TYPEA
# Out:    #include "some_header.h"
# Out:               #include "some_other_header23.h"

This works fine with cutting off #else or #endif , but as you can see #ifdef TYPEA and, even worse, all preceding lines are also matched.这可以很好地切断#else#endif ,但正如您所看到的#ifdef TYPEA ,更糟糕的是,所有前面的行也匹配。
If I remove the leading (\s*.*) from the pattern (or change it to (\s*) ), then I won't see any matching.如果我从模式中删除前导(\s*.*) (或将其更改为(\s*) ),那么我将看不到任何匹配。

How can I exclude the lines before #ifdef TYPEA and possibly also #ifdef TYPEA to get my desired match?如何排除#ifdef TYPEA之前的行以及可能还有#ifdef TYPEA以获得我想要的匹配? Thanks in advance!提前致谢!

Here's a way using named groups.这是一种使用命名组的方法。 You had solved most of the problem already.你已经解决了大部分问题。

Notice the change in the regex to include the #ifdef... part within (?P<M>...) .请注意正则表达式中的更改以在(?P<M>...)中包含#ifdef...部分。

import re

source_content = '\n'.join([
    '#ifdef TYPEZERO',
    '  int someint = 42;',
    '#endif',
    'void abc ( int value) {',
    '  return 5 ** 2.5',
    '}',
    '',
    'abc',
    '#ifdef TYPEA',                                 # <---- begin identifier, may contain leading/trailing whitespaces
    '#include "some_header.h"',                     # <---- I want these lines
    '           #include "some_other_header23.h"',  # <---- I want these lines
    '           #else        ',                     # optional stop identifier, may contain leading/trailing whitespaces
    'double in_fact_int = 5;',                      # some irrelevant content
    '         #endif    ',                          # final stop identifier, may contain leading/trailing whitespaces
    '',
    'a = 5',
    '#ifdef TYPEB',
    '  abc = 23.5;',
    '#endif',
])

pattern = re.compile(
    r'(\s*.*)#ifdef(\s+)TYPEA(\s*)(?P<M>(.*?)(?=((\s*)#else|(\s*)#endif)))',
    re.DOTALL
)
match = re.match(pattern, source_content)

print(match.group( "M" ))

You can use re.search instead of re.match and use group numbers to get parts of your regex results.您可以使用re.search而不是re.match并使用组号来获取部分正则表达式结果。

pattern = re.compile(
    r'\s+#ifdef\s+TYPEA\s*(.*?)(?=(\s*#else|\s*#endif))',
    re.DOTALL
)
match = re.search(pattern, source_content)

print(match.group(1))

Is it solving your problem?它解决了你的问题吗?

If you only want what's between #ifdef TYPEA and #else or #endif , you can match the whole thing and create a group between those keywords.如果您只想要#ifdef TYPEA#else#endif之间的内容,则可以匹配整个内容并在这些关键字之间创建一个组。 re.findall will return the group: re.findall将返回组:

import re
comment_pattern = re.compile(r'#ifdef TYPEA(.*?)(?:#else|#endif)', re.MULTILINE | re.DOTALL)
print(*re.findall(comment_pattern, source_content), sep='\n-------------\n')

Output: Output:

#include "some_header.h"
           #include "some_other_header23.h"

You can use the following regex with re.search (note re.match only returns matches that are found at the start of a string, so re.search is more versatile):您可以将以下正则表达式与re.search一起使用(注意re.match仅返回在字符串开头找到的匹配项,因此re.search更加通用):

#ifdef\s+TYPEA\s*(.*?)(?=\s*#(?:else|endif))

If you need multiple matches, you can plug this regex into a re.findall .如果您需要多个匹配项,可以将此正则表达式插入re.findall

See the regex demo .请参阅正则表达式演示 Details :详情

  • #ifdef - a fixed string #ifdef - 固定字符串
  • \s+ - one or more whitespaces \s+ - 一个或多个空格
  • TYPEA - a fixed string TYPEA - 固定字符串
  • \s* - zero or more whitespaces \s* - 零个或多个空格
  • (.*?) - Group 1: any zero or more chars as few as possible (also matches line break chars as re.DOTALL is used) (.*?) - 第 1 组:尽可能少的任何零个或多个字符(也匹配换行符,因为使用了re.DOTALL
  • (?=\s*#(?:else|endif)) - a positive lookahead that matches a location that is immediately followed with zero or more whitespaces and then either #else or #endif . (?=\s*#(?:else|endif)) - 与紧随其后的零个或多个空格然后是#else#endif的位置匹配的正向前瞻。

See the Python demo, too .也请参见Python 演示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM