[英]python regex get everything until specific strings
I have the following string:我有以下字符串:
This is the most recent email of this thread
More text
From: a@a.com
Date: 13 August, 2018
More text...
From: a@a.com
Sent: Tuesday 23 July
To: b@b.com, c@c.com
Subject: Test
I need to extract everything until this string combination:我需要提取所有内容,直到这个字符串组合:
From: *
Sent: *
To: *
Subject: *
The *
acts as a wildcard. *
充当通配符。
So my result should be:所以我的结果应该是:
This is the most recent email of this thread
More text
From: a@a.com
Date: 13 August, 2018
More text...
I want to filter this with a regular expression but I am not able to figure it out.我想用正则表达式过滤它,但我无法弄清楚。 Any pointers?
任何指针?
This is the regex pattern I tried in regex101 but it does not work in my python script for some reason: r"([\\w\\W\\n]+?)\\n((?:from:[^\\n]+)\\n+((?:\\s*sent:[^\\n]+)\\n+(?:\\s*to:[^\\n]+)\\n*(?:\\s*cc:[^\\n]+)*\\n*(?:\\s*bcc:[^\\n]+)*\\n*(?:\\s*subject:[^\\n]+)*))"
这是我在 regex101 中尝试的正则表达式模式,但由于某种原因它在我的 python 脚本中不起作用:
r"([\\w\\W\\n]+?)\\n((?:from:[^\\n]+)\\n+((?:\\s*sent:[^\\n]+)\\n+(?:\\s*to:[^\\n]+)\\n*(?:\\s*cc:[^\\n]+)*\\n*(?:\\s*bcc:[^\\n]+)*\\n*(?:\\s*subject:[^\\n]+)*))"
Thanks!谢谢!
You could try using re.findall
with a positive lookahead.您可以尝试使用
re.findall
进行正向re.findall
。 The approch here is to match everything from the start of the string up to, but not including, the block of text which should stop the match.这里的方法是匹配从字符串开始到(但不包括)应该停止匹配的文本块的所有内容。
inp = """This is the most recent email of this thread
More text
From: a@a.com
Date: 13 August, 2018
More text...
From: a@a.com
Sent: Tuesday 23 July
To: b@b.com, c@c.com
Subject: Test"""
stop_text = """From: a@a.com
Sent: Tuesday 23 July
To: b@b.com, c@c.com
Subject: Test"""
matches = re.findall(r'^.*?(?=' + stop_text + ')', inp, flags=re.DOTALL)
print(matches)
This prints:这打印:
['This is the most recent email of this thread\n\nMore text\n\nFrom: a@a.com\nDate: 13 August, 2018\n\nMore text...\n\n']
Considering the example you provided has the regex options gim
, maybe you just need to enable the flag re.IGNORECASE
?考虑到您提供的示例具有正则表达式选项
gim
,也许您只需要启用标志re.IGNORECASE
?
text = """
This is the most recent email of this thread
More text
From: a@a.com
Date: 13 August, 2018
More text...
From: a@a.com
Sent: Tuesday 23 July
To: b@b.com, c@c.com
Subject: Test
"""
pattern = "([\w\W\n]+?)\n((?:from:[^\n]+)\n+((?:\s*sent:[^\n]+)\n+(?:\s*to:[^\n]+)\n*(?:\s*cc:[^\n]+)*\n*(?:\s*bcc:[^\n]+)*\n*(?:\s*subject:[^\n]+)*))"
print(re.findall(pattern, text, re.MULTILINE|re.IGNORECASE))
prints印刷
[('\nThis is the most recent email of this thread\n\nMore text\n\nFrom: a@a.com\nDate: 13 August, 2018\n\nMore text...\n', 'From: a@a.com\nSent: Tuesday 23 July\nTo: b@b.com, c@c.com\nSubject: Test', 'Sent: Tuesday 23 July\nTo: b@b.com, c@c.com\nSubject: Test')]
you can make it simple with grouping....您可以通过分组使其变得简单....
import re
str = """This is the most recent email of this thread
More text
From: a@a.com
Date: 13 August, 2018
More text...
From: a@a.com
Sent: Tuesday 23 July
To: b@b.com, c@c.com
Subject: Test"""
x=re.match(r"""(.+?.+)
From:.+?
Sent:.+?
To: .+?,.+?
Subject:.+?.+""",str,flags=re.DOTALL|re.MULTILINE)
print(x.groups())
group will give...the following result...小组将给出...以下结果...
('This is the most recent email of this thread\n\nMore
text\n\nFrom:a@a.com\nDate:13 August, 2018\n\nMore text...\n')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.