简体   繁体   English

用于模式研究的 Python 正则表达式

[英]Python regular expression for pattern re.search

I want to extract the lines between a keyword and a sentence from text data.我想从文本数据中提取关键字和句子之间的行。 Here is my data,这是我的数据,

CUSTOMER SUPPLIED DATA: 
- topic: Sign in & Password Support
- First Name: Brenda
- Last Name: Delacruz
- Account number: xxxxxxxxx
- U-verse 4-digit PIN: My PIN is
- 4 digit PIN: xxxx
- Email: deedelacruz28806@yahoo.com
- I need help with: Forgot password or ID

  ***  System::[chat.automatonClientOutcome] Hello! How may I help you today?   ***  System::[chat.queueWaitDisplayed] We are currently experiencing very high chat volumes which may cause long delays. An agent will be with you as soon as possible.

Here help me to extract the lines under the key word "CUSTOMER SUPPLIED DATA:", before * system line starts.这里帮我提取关键字“CUSTOMER SUPPLIED DATA:”下的行,在*系统行开始之前 (extract lines between CUSTOMER SUPPLIED DATA: and * System line). (在 CUSTOMER SUPPLIED DATA: 和 * System 行之间提取行)。

I have tried the following code,我试过下面的代码,

m = re.search('CUSTOMER SUPPLIED DATA:\s*([^\n]+)', dt["chat_consolidation" 
     [546])

m.group(1)

which gives me only a single line between CUSTOMER SUPPLIED DATA: and *** system line这给了我 CUSTOMER SUPPLIED DATA: 和 *** system line 之间的一行

The output is like this:输出是这样的:

[out]: - topic: Sign in & Password Support

But my required output should be like this,但我需要的输出应该是这样的,

[Out]: - topic: Sign in & Password Support
- First Name: Brenda
- Last Name: Delacruz
- Account number: xxxxxxxxx
- U-verse 4-digit PIN: My PIN is
- 4 digit PIN: xxxx
- Email: deedelacruz28806@yahoo.com
- I need help with: Forgot password or ID

Thanks in advance for helping me.预先感谢您帮助我。

You would need regex module for this.为此,您需要regex模块。

x="""CUSTOMER SUPPLIED DATA: 
- topic: Sign in & Password Support
- First Name: Brenda
- Last Name: Delacruz
- Account number: xxxxxxxxx
- U-verse 4-digit PIN: My PIN is
- 4 digit PIN: xxxx
- Email: deedelacruz28806@yahoo.com
- I need help with: Forgot password or ID

  ***  System::[chat.automatonClientOutcome] Hello! How may I help you today?   ***  System::[chat.queueWaitDisplayed] We are currently experiencing very high chat volumes which may cause long delays. An agent will be with you as soon as possible.
- topic: Sign in & Password Support
- First Name: Brenda  
  """
import regex
print regex.findall(r"CUSTOMER SUPPLIED DATA: \n\K|\G(?!^)(-[^\n]+)\n", x, flags=regex.VERSION1)

Output: ['', '- topic: Sign in & Password Support', '- First Name: Brenda', '- Last Name: Delacruz', '- Account number: xxxxxxxxx', '- U-verse 4-digit PIN: My PIN is', '- 4 digit PIN: xxxx', '- Email: deedelacruz28806@yahoo.com', '- I need help with: Forgot password or ID']输出: ['', '- topic: Sign in & Password Support', '- First Name: Brenda', '- Last Name: Delacruz', '- Account number: xxxxxxxxx', '- U-verse 4-digit PIN: My PIN is', '- 4 digit PIN: xxxx', '- Email: deedelacruz28806@yahoo.com', '- I need help with: Forgot password or ID']

See demo.见演示。

https://regex101.com/r/naH3C7/2 https://regex101.com/r/naH3C7/2

@vks is correct that the regex module would be better if you want to split it up like that. @vks 是正确的,如果您想像这样拆分正则表达式模块会更好。 However, if you really just want what you ask for (a string with everything between CUSTOMER SUPPLIED DATA: and "*** System:"), changing the regexp to something like this works as well:但是,如果您真的只想要您所要求的(包含 CUSTOMER SUPPLIED DATA: 和 "*** System:" 之间的所有内容的字符串),将正则表达式更改为这样的内容也可以:

re.search("CUSTOMER SUPPLIED DATA:\s*(.+?)\*\*\*  System:", x, re.DOTALL).

With "([^\\n]+)" you ask it to include everything until it hits a \\n which is probably not what you want.使用 "([^\\n]+)" 你要求它包含所有内容,直到遇到一个 \\n 这可能不是你想要的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM