python正则表达式获取所有内容，直到特定字符串

Question

I have the following string:我有以下字符串：

This is the most recent email of this thread

More text

From: a@a.com
Date: 13 August, 2018

More text...

From: a@a.com
Sent: Tuesday 23 July
To: b@b.com, c@c.com
Subject: Test

I need to extract everything until this string combination:我需要提取所有内容，直到这个字符串组合：

From: *
Sent: *
To: *
Subject: *

The * acts as a wildcard. *充当通配符。

So my result should be:所以我的结果应该是：

This is the most recent email of this thread

More text

From: a@a.com
Date: 13 August, 2018

More text...

I want to filter this with a regular expression but I am not able to figure it out.我想用正则表达式过滤它，但我无法弄清楚。 Any pointers?任何指针？

This is the regex pattern I tried in regex101 but it does not work in my python script for some reason: r"([\\w\\W\\n]+?)\\n((?:from:[^\\n]+)\\n+((?:\\s*sent:[^\\n]+)\\n+(?:\\s*to:[^\\n]+)\\n*(?:\\s*cc:[^\\n]+)*\\n*(?:\\s*bcc:[^\\n]+)*\\n*(?:\\s*subject:[^\\n]+)*))"这是我在 regex101 中尝试的正则表达式模式，但由于某种原因它在我的 python 脚本中不起作用： r"([\\w\\W\\n]+?)\\n((?:from:[^\\n]+)\\n+((?:\\s*sent:[^\\n]+)\\n+(?:\\s*to:[^\\n]+)\\n*(?:\\s*cc:[^\\n]+)*\\n*(?:\\s*bcc:[^\\n]+)*\\n*(?:\\s*subject:[^\\n]+)*))"

Thanks!谢谢！

Answer 1

You could try using re.findall with a positive lookahead.您可以尝试使用re.findall进行正向re.findall 。 The approch here is to match everything from the start of the string up to, but not including, the block of text which should stop the match.这里的方法是匹配从字符串开始到（但不包括）应该停止匹配的文本块的所有内容。

inp = """This is the most recent email of this thread

More text

From: a@a.com
Date: 13 August, 2018

More text...

From: a@a.com
Sent: Tuesday 23 July
To: b@b.com, c@c.com
Subject: Test"""

stop_text = """From: a@a.com
Sent: Tuesday 23 July
To: b@b.com, c@c.com
Subject: Test"""
matches = re.findall(r'^.*?(?=' + stop_text + ')', inp, flags=re.DOTALL)
print(matches)

This prints:这打印：

['This is the most recent email of this thread\n\nMore text\n\nFrom: a@a.com\nDate: 13 August, 2018\n\nMore text...\n\n']

Answer 2

Considering the example you provided has the regex options gim , maybe you just need to enable the flag re.IGNORECASE ?考虑到您提供的示例具有正则表达式选项gim ，也许您只需要启用标志re.IGNORECASE ？

text = """
This is the most recent email of this thread

More text

From: a@a.com
Date: 13 August, 2018

More text...

From: a@a.com
Sent: Tuesday 23 July
To: b@b.com, c@c.com
Subject: Test
"""
pattern = "([\w\W\n]+?)\n((?:from:[^\n]+)\n+((?:\s*sent:[^\n]+)\n+(?:\s*to:[^\n]+)\n*(?:\s*cc:[^\n]+)*\n*(?:\s*bcc:[^\n]+)*\n*(?:\s*subject:[^\n]+)*))"
print(re.findall(pattern, text, re.MULTILINE|re.IGNORECASE))

prints印刷

[('\nThis is the most recent email of this thread\n\nMore text\n\nFrom: a@a.com\nDate: 13 August, 2018\n\nMore text...\n', 'From: a@a.com\nSent: Tuesday 23 July\nTo: b@b.com, c@c.com\nSubject: Test', 'Sent: Tuesday 23 July\nTo: b@b.com, c@c.com\nSubject: Test')]

Answer 3

you can make it simple with grouping....您可以通过分组使其变得简单....

import re   
str = """This is the most recent email of this thread

More text

From: a@a.com
Date: 13 August, 2018

More text...

From: a@a.com
Sent: Tuesday 23 July
To: b@b.com, c@c.com
Subject: Test"""

x=re.match(r"""(.+?.+)
From:.+?
Sent:.+?
To: .+?,.+?
Subject:.+?.+""",str,flags=re.DOTALL|re.MULTILINE)
print(x.groups())

group will give...the following result...小组将给出...以下结果...

('This is the most recent email of this thread\n\nMore 
text\n\nFrom:a@a.com\nDate:13 August, 2018\n\nMore text...\n')

python正则表达式获取所有内容，直到特定字符串

问题描述

3 个解决方案

解决方案1
1 2020-03-30 17:24:44

解决方案2
0 2020-03-30 18:06:35

解决方案3
0 2020-03-31 09:53:52

python正则表达式获取所有内容，直到特定字符串

问题描述

3 个解决方案

解决方案1 1 2020-03-30 17:24:44

解决方案2 0 2020-03-30 18:06:35

解决方案3 0 2020-03-31 09:53:52

解决方案1
1 2020-03-30 17:24:44

解决方案2
0 2020-03-30 18:06:35

解决方案3
0 2020-03-31 09:53:52