[英]Regex findall with \n in text
我有以下测试字符串:
================================================================================\nCorporate Participants\n================================================================================\n * Kirk Walters\n Chittenden Corporation - EVP and Chief Financial Officer and Treasurer and CTC\n\n================================================================================\nConference Call Participants\n================================================================================\n * Beth Messmore\n Merrill Lynch - Analyst\n * Troy Ward\n A.G. Edwards - Analyst\n * Lori Hasiner\n FBR - Analyst\n * Tom Doheny\n Sandler O\'Neill - Analyst\n * Gerard Cassidy\n RBC Capital Markets - Analyst\n * Faye Elliott-Gurney\n Lehman Brothers - Analyst\n\n================================================================================
我想得到* Kirk Walters\n
和
* Beth Messmore\n Merrill Lynch - Analyst\n * Troy Ward\n A.G. Edwards - Analyst\n * Lori Hasiner\n FBR - Analyst\n * Tom Doheny\n Sandler O\'Neill - Analyst\n * Gerard Cassidy\n RBC Capital Markets - Analyst\n * Faye Elliott-Gurney\n Lehman Brothers - Analyst\n
我的第一个代码是
participants_corp = re.findall('Corporate Participants\n================================================================================\n (.*)\n\n================================================================================\nConference Call Participants', str)
我认为它必须对换行命令的反斜杠做一些事情。 我尝试使用四个反斜杠而不是一个,但这并没有改变任何东西。 你能给点建议吗?
没有正则表达式,您可以使用:
import io
buf = io.StringIO(text)
data = []
for line in buf:
line = line.strip()
if line.startswith('*'):
line1 = next(buf).strip().split('-')
data.append({'name': line[1:].strip(),
'company': line1[0].strip(),
'job': line1[1].strip()})
print(data)
# Output
[{'name': 'Kirk Walters',
'company': 'Chittenden Corporation',
'job': 'EVP and Chief Financial Officer and Treasurer and CTC'},
{'name': 'Beth Messmore', 'company': 'Merrill Lynch', 'job': 'Analyst'},
{'name': 'Troy Ward', 'company': 'A.G. Edwards', 'job': 'Analyst'},
{'name': 'Lori Hasiner', 'company': 'FBR', 'job': 'Analyst'},
{'name': 'Tom Doheny', 'company': "Sandler O'Neill", 'job': 'Analyst'},
{'name': 'Gerard Cassidy',
'company': 'RBC Capital Markets',
'job': 'Analyst'},
{'name': 'Faye Elliott-Gurney',
'company': 'Lehman Brothers',
'job': 'Analyst'}]
设置:
text = """\
================================================================================
Corporate Participants
================================================================================
* Kirk Walters
Chittenden Corporation - EVP and Chief Financial Officer and Treasurer and CTC
================================================================================
Conference Call Participants
================================================================================
* Beth Messmore
Merrill Lynch - Analyst
* Troy Ward
A.G. Edwards - Analyst
* Lori Hasiner
FBR - Analyst
* Tom Doheny
Sandler O'Neill - Analyst
* Gerard Cassidy
RBC Capital Markets - Analyst
* Faye Elliott-Gurney
Lehman Brothers - Analyst
================================================================================"""
您的正则表达式的问题是组(.*)
匹配除行终止符之外的所有字符,而您所需的字符串具有行终止符。 您应该出于您的目的尝试此正则表达式:
import re
your_string = "================================================================================\nCorporate Participants\n================================================================================\n * Kirk Walters\n Chittenden Corporation - EVP and Chief Financial Officer and Treasurer and CTC\n\n================================================================================\nConference Call Participants\n================================================================================\n * Beth Messmore\n Merrill Lynch - Analyst\n * Troy Ward\n A.G. Edwards - Analyst\n * Lori Hasiner\n FBR - Analyst\n * Tom Doheny\n Sandler O\'Neill - Analyst\n * Gerard Cassidy\n RBC Capital Markets - Analyst\n * Faye Elliott-Gurney\n Lehman Brothers - Analyst\n\n================================================================================"
participants_corp = re.findall(r"Corporate Participants\n================================================================================\n ([\S\s]*)\n\n================================================================================\nConference Call Participants", your_string)
print(participants_corp)
首先,将您的文本分成两个块: Corporate Participants
和Conference Call Participants
。
import re
given_text = "================================================================================\nCorporate Participants\n================================================================================\n * Kirk Walters\n Chittenden Corporation - EVP and Chief Financial Officer and Treasurer and CTC\n\n================================================================================\nConference Call Participants\n================================================================================\n * Beth Messmore\n Merrill Lynch - Analyst\n * Troy Ward\n A.G. Edwards - Analyst\n * Lori Hasiner\n FBR - Analyst\n * Tom Doheny\n Sandler O\'Neill - Analyst\n * Gerard Cassidy\n RBC Capital Markets - Analyst\n * Faye Elliott-Gurney\n Lehman Brothers - Analyst\n\n================================================================================"
p_block = re.compile(r"================================================================================\n([^\n]*)\n================================================================================\n([^=]*)", re.MULTILINE | re.DOTALL)
blocks = p_block.findall(given_text)
print(blocks)
Output:
[
(
'Corporate Participants',
' * Kirk Walters\n Chittenden Corporation - EVP and Chief Financial Officer and Treasurer and CTC\n\n'),
(
'Conference Call Participants',
" * Beth Messmore\n Merrill Lynch - Analyst\n * Troy Ward\n A.G. Edwards - Analyst\n * Lori Hasiner\n FBR - Analyst\n * Tom Doheny\n Sandler O'Neill - Analyst\n * Gerard Cassidy\n RBC Capital Markets - Analyst\n * Faye Elliott-Gurney\n Lehman Brothers - Analyst\n\n")
]
re.MULTILINE
即使里面有多行也能捕获块,并且re.DOTALL
帮助.
捕获每个字符,包括行终止符。
之后,用星号和换行符解析每个块。
p_participant = re.compile(r"\* (.*)\n", re.MULTILINE)
for title, content in blocks:
participants = p_participant.findall(content)
print(title, participants)
Output:
Corporate Participants ['Kirk Walters']
Conference Call Participants ['Beth Messmore', 'Troy Ward', 'Lori Hasiner', 'Tom Doheny', 'Gerard Cassidy', 'Faye Elliott-Gurney']
它不使用re.DOTALL
,因为在这个正则表达式中,点不必捕获换行符。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.