简体   繁体   English

RegEx - 如何仅获取在大输出中重复的多行文本块?

[英]RegEx - How to get only a multiline block of text that repeats in a large output?

I'm parsing a large output (25 MB provided here ) from a quantum chemistry software.我正在解析来自量子化学软件的大量输出( 此处提供25 MB)。 The software performs a calculation using two methods: CASSCF and NEVPT2.该软件使用两种方法执行计算:CASSCF 和 NEVPT2。 Each method performs the same calculation, leading to different results.每种方法执行相同的计算,导致不同的结果。 Actually, I've set my script to run the calculation several times for different configurations, so I have something organized like this in the end:实际上,我已经将我的脚本设置为针对不同的配置多次运行计算,所以最后我有这样的组织:

JOB 1
CASSCF RESULTS
***
Lots of text
***
end
NEVPT2 RESULTS
***
Lots of text
***
end

JOB 2
CASSCF RESULTS
***
Lots of text
***
end
NEVPT2 RESULTS
***
Lots of text
***
end
………………
JOB 31
CASSCF RESULTS
***
Lots of text
***
end
NEVPT2 RESULTS
***
Lots of text
***
end

I only want the NEVPT2 results and I've set my regular expression as this one (applied to the actual output (my example above is just to show the organization):我只想要 NEVPT2 结果并且我已经将我的正则表达式设置为这个(应用于实际输出(我上面的示例只是为了显示组织):

NEVPT2_Section = r"(?:AILFT MATRIX ELEMENTS \(NEVPT2\)\n-+\n\n)([\s\S]*)(?:\n\n--------------\nCASSCF TIMINGS)"
NEVPT2_Section_mathes = re.finditer(NEVPT2_Section, inp_content, re.MULTILINE)

for xyz in NEVPT2_Section_mathes:
    my_xyz = xyz.group(1)
    print(my_xyz)

If I'm working with a file that has only one job it works fine, starting from “NEVPT2 RESULTS” and stopping at the first “end” but, the multi-job file finds the first “NEVPT2 RESULTS” and goes on until the last “end”, catching everything in between.如果我正在处理只有一个作业的文件,它工作正常,从“NEVPT2 RESULTS”开始并在第一个“结束”处停止,但是,多作业文件找到第一个“NEVPT2 RESULTS”并继续运行,直到最后一个“结束”,捕捉两者之间的一切。

So, after wasting the whole Sunday trying to figure this out, I'm asking for your advice, guys.所以,在浪费了整个星期天试图弄清楚这一点之后,我在寻求你们的建议,伙计们。 How can I get only the bits from each NEVPT2 section?如何仅从每个 NEVPT2 部分获取位?

You could use你可以用

^NEVPT2.+?^end

in single and multiline mode, see a demo on regex101.com .singlemultiline模式下,请参阅regex101.com 上的演示

As an alternative you could match the line at the beginning ^NEVPT2.*\\n and continue matching all lines that do not start with end using a negative lookahead (?!end$) using the multiline flag.作为替代方案,您可以匹配开头的行^NEVPT2.*\\n并使用多行标志继续匹配所有不以 end 开头的行,使用负前瞻(?!end$)

^NEVPT2.*\n(?:(?!end$).*\n)*end$

Regex demo |正则表达式演示| Python demo Python 演示

For example例如

NEVPT2_Section = r"^NEVPT2.*\n(?:(?!end$).*\n)*end$"
NEVPT2_Section_mathes = re.finditer(NEVPT2_Section, inp_content, re.MULTILINE)

for xyz in NEVPT2_Section_mathes:
    print(xyz.group())

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM