繁体   English   中英

文本中带有 \n 的正则表达式 findall

[英]Regex findall with \n in text

我有以下测试字符串:

================================================================================\nCorporate Participants\n================================================================================\n   *  Kirk Walters\n      Chittenden Corporation - EVP and Chief Financial Officer and Treasurer and CTC\n\n================================================================================\nConference Call Participants\n================================================================================\n   *  Beth Messmore\n      Merrill Lynch - Analyst\n   *  Troy Ward\n      A.G. Edwards - Analyst\n   *  Lori Hasiner\n      FBR - Analyst\n   *  Tom Doheny\n      Sandler O\'Neill - Analyst\n   *  Gerard Cassidy\n      RBC Capital Markets - Analyst\n   *  Faye Elliott-Gurney\n      Lehman Brothers - Analyst\n\n================================================================================

我想得到* Kirk Walters\n

*  Beth Messmore\n      Merrill Lynch - Analyst\n   *  Troy Ward\n      A.G. Edwards - Analyst\n   *  Lori Hasiner\n      FBR - Analyst\n   *  Tom Doheny\n      Sandler O\'Neill - Analyst\n   *  Gerard Cassidy\n      RBC Capital Markets - Analyst\n   *  Faye Elliott-Gurney\n      Lehman Brothers - Analyst\n

我的第一个代码是

participants_corp = re.findall('Corporate Participants\n================================================================================\n   (.*)\n\n================================================================================\nConference Call Participants', str)

我认为它必须对换行命令的反斜杠做一些事情。 我尝试使用四个反斜杠而不是一个,但这并没有改变任何东西。 你能给点建议吗?

没有正则表达式,您可以使用:

import io

buf = io.StringIO(text)
data = []
for line in buf:
    line = line.strip()
    if line.startswith('*'):
        line1 = next(buf).strip().split('-')
        data.append({'name': line[1:].strip(),
                     'company': line1[0].strip(),
                     'job': line1[1].strip()})
print(data)

# Output
[{'name': 'Kirk Walters',
  'company': 'Chittenden Corporation',
  'job': 'EVP and Chief Financial Officer and Treasurer and CTC'},
 {'name': 'Beth Messmore', 'company': 'Merrill Lynch', 'job': 'Analyst'},
 {'name': 'Troy Ward', 'company': 'A.G. Edwards', 'job': 'Analyst'},
 {'name': 'Lori Hasiner', 'company': 'FBR', 'job': 'Analyst'},
 {'name': 'Tom Doheny', 'company': "Sandler O'Neill", 'job': 'Analyst'},
 {'name': 'Gerard Cassidy',
  'company': 'RBC Capital Markets',
  'job': 'Analyst'},
 {'name': 'Faye Elliott-Gurney',
  'company': 'Lehman Brothers',
  'job': 'Analyst'}]

设置:

text = """\
================================================================================
Corporate Participants
================================================================================
   *  Kirk Walters
      Chittenden Corporation - EVP and Chief Financial Officer and Treasurer and CTC

================================================================================
Conference Call Participants
================================================================================
   *  Beth Messmore
      Merrill Lynch - Analyst
   *  Troy Ward
      A.G. Edwards - Analyst
   *  Lori Hasiner
      FBR - Analyst
   *  Tom Doheny
      Sandler O'Neill - Analyst
   *  Gerard Cassidy
      RBC Capital Markets - Analyst
   *  Faye Elliott-Gurney
      Lehman Brothers - Analyst

================================================================================"""

您的正则表达式的问题是组(.*)匹配除行终止符之外的所有字符,而您所需的字符串具有行终止符。 您应该出于您的目的尝试此正则表达式:

import re

your_string = "================================================================================\nCorporate Participants\n================================================================================\n   *  Kirk Walters\n      Chittenden Corporation - EVP and Chief Financial Officer and Treasurer and CTC\n\n================================================================================\nConference Call Participants\n================================================================================\n   *  Beth Messmore\n      Merrill Lynch - Analyst\n   *  Troy Ward\n      A.G. Edwards - Analyst\n   *  Lori Hasiner\n      FBR - Analyst\n   *  Tom Doheny\n      Sandler O\'Neill - Analyst\n   *  Gerard Cassidy\n      RBC Capital Markets - Analyst\n   *  Faye Elliott-Gurney\n      Lehman Brothers - Analyst\n\n================================================================================"
participants_corp = re.findall(r"Corporate Participants\n================================================================================\n   ([\S\s]*)\n\n================================================================================\nConference Call Participants", your_string)
print(participants_corp)

首先,将您的文本分成两个块: Corporate ParticipantsConference Call Participants

import re

given_text = "================================================================================\nCorporate Participants\n================================================================================\n   *  Kirk Walters\n      Chittenden Corporation - EVP and Chief Financial Officer and Treasurer and CTC\n\n================================================================================\nConference Call Participants\n================================================================================\n   *  Beth Messmore\n      Merrill Lynch - Analyst\n   *  Troy Ward\n      A.G. Edwards - Analyst\n   *  Lori Hasiner\n      FBR - Analyst\n   *  Tom Doheny\n      Sandler O\'Neill - Analyst\n   *  Gerard Cassidy\n      RBC Capital Markets - Analyst\n   *  Faye Elliott-Gurney\n      Lehman Brothers - Analyst\n\n================================================================================"

p_block = re.compile(r"================================================================================\n([^\n]*)\n================================================================================\n([^=]*)", re.MULTILINE | re.DOTALL)
blocks = p_block.findall(given_text)
print(blocks)

Output:

[
    (
        'Corporate Participants', 
        '   *  Kirk Walters\n      Chittenden Corporation - EVP and Chief Financial Officer and Treasurer and CTC\n\n'), 
    (
        'Conference Call Participants', 
        "   *  Beth Messmore\n      Merrill Lynch - Analyst\n   *  Troy Ward\n      A.G. Edwards - Analyst\n   *  Lori Hasiner\n      FBR - Analyst\n   *  Tom Doheny\n      Sandler O'Neill - Analyst\n   *  Gerard Cassidy\n      RBC Capital Markets - Analyst\n   *  Faye Elliott-Gurney\n      Lehman Brothers - Analyst\n\n")
]

re.MULTILINE即使里面有多行也能捕获块,并且re.DOTALL帮助. 捕获每个字符,包括行终止符。

之后,用星号和换行符解析每个块。

p_participant = re.compile(r"\*  (.*)\n", re.MULTILINE)
for title, content in blocks:
    participants = p_participant.findall(content)
    print(title, participants)

Output:

Corporate Participants ['Kirk Walters']
Conference Call Participants ['Beth Messmore', 'Troy Ward', 'Lori Hasiner', 'Tom Doheny', 'Gerard Cassidy', 'Faye Elliott-Gurney']

它不使用re.DOTALL ,因为在这个正则表达式中,点不必捕获换行符。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM