繁体   English   中英

用于模式研究的 Python 正则表达式

[英]Python regular expression for pattern re.search

我想从文本数据中提取关键字和句子之间的行。 这是我的数据,

CUSTOMER SUPPLIED DATA: 
- topic: Sign in & Password Support
- First Name: Brenda
- Last Name: Delacruz
- Account number: xxxxxxxxx
- U-verse 4-digit PIN: My PIN is
- 4 digit PIN: xxxx
- Email: deedelacruz28806@yahoo.com
- I need help with: Forgot password or ID

  ***  System::[chat.automatonClientOutcome] Hello! How may I help you today?   ***  System::[chat.queueWaitDisplayed] We are currently experiencing very high chat volumes which may cause long delays. An agent will be with you as soon as possible.

这里帮我提取关键字“CUSTOMER SUPPLIED DATA:”下的行,在*系统行开始之前 (在 CUSTOMER SUPPLIED DATA: 和 * System 行之间提取行)。

我试过下面的代码,

m = re.search('CUSTOMER SUPPLIED DATA:\s*([^\n]+)', dt["chat_consolidation" 
     [546])

m.group(1)

这给了我 CUSTOMER SUPPLIED DATA: 和 *** system line 之间的一行

输出是这样的:

[out]: - topic: Sign in & Password Support

但我需要的输出应该是这样的,

[Out]: - topic: Sign in & Password Support
- First Name: Brenda
- Last Name: Delacruz
- Account number: xxxxxxxxx
- U-verse 4-digit PIN: My PIN is
- 4 digit PIN: xxxx
- Email: deedelacruz28806@yahoo.com
- I need help with: Forgot password or ID

预先感谢您帮助我。

为此,您需要regex模块。

x="""CUSTOMER SUPPLIED DATA: 
- topic: Sign in & Password Support
- First Name: Brenda
- Last Name: Delacruz
- Account number: xxxxxxxxx
- U-verse 4-digit PIN: My PIN is
- 4 digit PIN: xxxx
- Email: deedelacruz28806@yahoo.com
- I need help with: Forgot password or ID

  ***  System::[chat.automatonClientOutcome] Hello! How may I help you today?   ***  System::[chat.queueWaitDisplayed] We are currently experiencing very high chat volumes which may cause long delays. An agent will be with you as soon as possible.
- topic: Sign in & Password Support
- First Name: Brenda  
  """
import regex
print regex.findall(r"CUSTOMER SUPPLIED DATA: \n\K|\G(?!^)(-[^\n]+)\n", x, flags=regex.VERSION1)

输出: ['', '- topic: Sign in & Password Support', '- First Name: Brenda', '- Last Name: Delacruz', '- Account number: xxxxxxxxx', '- U-verse 4-digit PIN: My PIN is', '- 4 digit PIN: xxxx', '- Email: deedelacruz28806@yahoo.com', '- I need help with: Forgot password or ID']

见演示。

https://regex101.com/r/naH3C7/2

@vks 是正确的,如果您想像这样拆分正则表达式模块会更好。 但是,如果您真的只想要您所要求的(包含 CUSTOMER SUPPLIED DATA: 和 "*** System:" 之间的所有内容的字符串),将正则表达式更改为这样的内容也可以:

re.search("CUSTOMER SUPPLIED DATA:\s*(.+?)\*\*\*  System:", x, re.DOTALL).

使用 "([^\\n]+)" 你要求它包含所有内容,直到遇到一个 \\n 这可能不是你想要的。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM