简体   繁体   中英

Python regular expression for pattern re.search

I want to extract the lines between a keyword and a sentence from text data. Here is my data,

CUSTOMER SUPPLIED DATA: 
- topic: Sign in & Password Support
- First Name: Brenda
- Last Name: Delacruz
- Account number: xxxxxxxxx
- U-verse 4-digit PIN: My PIN is
- 4 digit PIN: xxxx
- Email: deedelacruz28806@yahoo.com
- I need help with: Forgot password or ID

  ***  System::[chat.automatonClientOutcome] Hello! How may I help you today?   ***  System::[chat.queueWaitDisplayed] We are currently experiencing very high chat volumes which may cause long delays. An agent will be with you as soon as possible.

Here help me to extract the lines under the key word "CUSTOMER SUPPLIED DATA:", before * system line starts. (extract lines between CUSTOMER SUPPLIED DATA: and * System line).

I have tried the following code,

m = re.search('CUSTOMER SUPPLIED DATA:\s*([^\n]+)', dt["chat_consolidation" 
     [546])

m.group(1)

which gives me only a single line between CUSTOMER SUPPLIED DATA: and *** system line

The output is like this:

[out]: - topic: Sign in & Password Support

But my required output should be like this,

[Out]: - topic: Sign in & Password Support
- First Name: Brenda
- Last Name: Delacruz
- Account number: xxxxxxxxx
- U-verse 4-digit PIN: My PIN is
- 4 digit PIN: xxxx
- Email: deedelacruz28806@yahoo.com
- I need help with: Forgot password or ID

Thanks in advance for helping me.

You would need regex module for this.

x="""CUSTOMER SUPPLIED DATA: 
- topic: Sign in & Password Support
- First Name: Brenda
- Last Name: Delacruz
- Account number: xxxxxxxxx
- U-verse 4-digit PIN: My PIN is
- 4 digit PIN: xxxx
- Email: deedelacruz28806@yahoo.com
- I need help with: Forgot password or ID

  ***  System::[chat.automatonClientOutcome] Hello! How may I help you today?   ***  System::[chat.queueWaitDisplayed] We are currently experiencing very high chat volumes which may cause long delays. An agent will be with you as soon as possible.
- topic: Sign in & Password Support
- First Name: Brenda  
  """
import regex
print regex.findall(r"CUSTOMER SUPPLIED DATA: \n\K|\G(?!^)(-[^\n]+)\n", x, flags=regex.VERSION1)

Output: ['', '- topic: Sign in & Password Support', '- First Name: Brenda', '- Last Name: Delacruz', '- Account number: xxxxxxxxx', '- U-verse 4-digit PIN: My PIN is', '- 4 digit PIN: xxxx', '- Email: deedelacruz28806@yahoo.com', '- I need help with: Forgot password or ID']

See demo.

https://regex101.com/r/naH3C7/2

@vks is correct that the regex module would be better if you want to split it up like that. However, if you really just want what you ask for (a string with everything between CUSTOMER SUPPLIED DATA: and "*** System:"), changing the regexp to something like this works as well:

re.search("CUSTOMER SUPPLIED DATA:\s*(.+?)\*\*\*  System:", x, re.DOTALL).

With "([^\\n]+)" you ask it to include everything until it hits a \\n which is probably not what you want.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM