[英]Python regex pattern to find string enclosed in preprocessor directives
I'm trying to read C++ source files with Python to extract loaded header files.我正在尝试使用 Python 读取 C++ 源文件以提取加载的 header 文件。
The header files are specified between #ifdef TYPEA
and #else
OR #endif
. header 文件在#ifdef TYPEA
和#else
OR #endif
之间指定。 If there is an #else
-clause , the header files will always be specified before the #else
-clause.如果有#else
-clause ,则 header 文件将始终在#else
-clause之前指定。
Let's assume an excerpt of the content of the source looks like this:让我们假设源内容的摘录如下所示:
source_content = '\n'.join([
'#ifdef TYPEZERO',
' int someint = 42;',
'#endif',
'void abc ( int value) {',
' return 5 ** 2.5',
'}',
'',
'abc',
'#ifdef TYPEA', # <---- begin identifier, may contain leading/trailing whitespaces
'#include "some_header.h"', # <---- I want these lines
' #include "some_other_header23.h"', # <---- I want these lines
' #else ', # optional stop identifier, may contain leading/trailing whitespaces
'double in_fact_int = 5;', # some irrelevant content
' #endif ', # final stop identifier, may contain leading/trailing whitespaces
'',
'a = 5',
'#ifdef TYPEB',
' abc = 23.5;',
'#endif',
])
I'd like to extract lines with comments excluding #ifdef TYPEA
, #else
, #endif
, such that my result is:我想提取带有除#ifdef TYPEA
、 #else
、 #endif
之外的注释的行,这样我的结果是:
desired_match = '#include "some_header.h"\n #include "some_other_header23.h"'
print(desired_match)
# Out: #include "some_header.h"
# Out: #include "some_other_header23.h"
Removing the whitespaces would be nice, but I'm fine with doing this separately from regex.删除空格会很好,但我可以与正则表达式分开执行此操作。
My current approach is:我目前的做法是:
import re
pattern = re.compile(
r'(\s*.*)#ifdef(\s+)TYPEA(\s*)(.*?)(?=((\s*)#else|(\s*)#endif))',
re.DOTALL
)
match = re.match(pattern, source_content)
print(match.group())
# Out: #ifdef TYPEZERO
# Out: int someint = 42;
# Out: #endif
# Out: void abc ( int value) {
# Out: return 5 ** 2.5
# Out: }
# Out:
# Out: abc
# Out: #ifdef TYPEA
# Out: #include "some_header.h"
# Out: #include "some_other_header23.h"
This works fine with cutting off #else
or #endif
, but as you can see #ifdef TYPEA
and, even worse, all preceding lines are also matched.这可以很好地切断#else
或#endif
,但正如您所看到的#ifdef TYPEA
,更糟糕的是,所有前面的行也匹配。
If I remove the leading (\s*.*)
from the pattern (or change it to (\s*)
), then I won't see any matching.如果我从模式中删除前导(\s*.*)
(或将其更改为(\s*)
),那么我将看不到任何匹配。
How can I exclude the lines before #ifdef TYPEA
and possibly also #ifdef TYPEA
to get my desired match?如何排除#ifdef TYPEA
之前的行以及可能还有#ifdef TYPEA
以获得我想要的匹配? Thanks in advance!提前致谢!
Here's a way using named groups.这是一种使用命名组的方法。 You had solved most of the problem already.你已经解决了大部分问题。
Notice the change in the regex to include the #ifdef...
part within (?P<M>...)
.请注意正则表达式中的更改以在(?P<M>...)
中包含#ifdef...
部分。
import re
source_content = '\n'.join([
'#ifdef TYPEZERO',
' int someint = 42;',
'#endif',
'void abc ( int value) {',
' return 5 ** 2.5',
'}',
'',
'abc',
'#ifdef TYPEA', # <---- begin identifier, may contain leading/trailing whitespaces
'#include "some_header.h"', # <---- I want these lines
' #include "some_other_header23.h"', # <---- I want these lines
' #else ', # optional stop identifier, may contain leading/trailing whitespaces
'double in_fact_int = 5;', # some irrelevant content
' #endif ', # final stop identifier, may contain leading/trailing whitespaces
'',
'a = 5',
'#ifdef TYPEB',
' abc = 23.5;',
'#endif',
])
pattern = re.compile(
r'(\s*.*)#ifdef(\s+)TYPEA(\s*)(?P<M>(.*?)(?=((\s*)#else|(\s*)#endif)))',
re.DOTALL
)
match = re.match(pattern, source_content)
print(match.group( "M" ))
You can use re.search
instead of re.match
and use group numbers to get parts of your regex results.您可以使用re.search
而不是re.match
并使用组号来获取部分正则表达式结果。
pattern = re.compile(
r'\s+#ifdef\s+TYPEA\s*(.*?)(?=(\s*#else|\s*#endif))',
re.DOTALL
)
match = re.search(pattern, source_content)
print(match.group(1))
Is it solving your problem?它解决了你的问题吗?
If you only want what's between #ifdef TYPEA
and #else
or #endif
, you can match the whole thing and create a group between those keywords.如果您只想要#ifdef TYPEA
和#else
或#endif
之间的内容,则可以匹配整个内容并在这些关键字之间创建一个组。 re.findall
will return the group: re.findall
将返回组:
import re
comment_pattern = re.compile(r'#ifdef TYPEA(.*?)(?:#else|#endif)', re.MULTILINE | re.DOTALL)
print(*re.findall(comment_pattern, source_content), sep='\n-------------\n')
Output: Output:
#include "some_header.h"
#include "some_other_header23.h"
You can use the following regex with re.search
(note re.match
only returns matches that are found at the start of a string, so re.search
is more versatile):您可以将以下正则表达式与re.search
一起使用(注意re.match
仅返回在字符串开头找到的匹配项,因此re.search
更加通用):
#ifdef\s+TYPEA\s*(.*?)(?=\s*#(?:else|endif))
If you need multiple matches, you can plug this regex into a re.findall
.如果您需要多个匹配项,可以将此正则表达式插入re.findall
。
See the regex demo .请参阅正则表达式演示。 Details :详情:
#ifdef
- a fixed string #ifdef
- 固定字符串\s+
- one or more whitespaces \s+
- 一个或多个空格TYPEA
- a fixed string TYPEA
- 固定字符串\s*
- zero or more whitespaces \s*
- 零个或多个空格(.*?)
- Group 1: any zero or more chars as few as possible (also matches line break chars as re.DOTALL
is used) (.*?)
- 第 1 组:尽可能少的任何零个或多个字符(也匹配换行符,因为使用了re.DOTALL
)(?=\s*#(?:else|endif))
- a positive lookahead that matches a location that is immediately followed with zero or more whitespaces and then either #else
or #endif
. (?=\s*#(?:else|endif))
- 与紧随其后的零个或多个空格然后是#else
或#endif
的位置匹配的正向前瞻。See the Python demo, too .也请参见Python 演示。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.