简体   繁体   中英

python regex to select everything before and after a particular string

I am trying to apply regex on one of the columns in pandas dataframe, this column has text data in it, I am trying to extract a specific block. This is a sample of how my data will look like,

Patient Name :
NHI:  ABC2134
DOB:  10/03/1737

Patient Referred from: WTH ABC
Exam performed at:  XYZ Hospital Radiology
Reference:   ABCADAFAD
Date of exam:   12/11/2019
Examination(s) included in this report:
 CT Head

INDICATION:
Fall some time ago with ataxia since. Recent admission with 
tachybrady syndrome. 

It's because your regex is:

(?s)Patient Referred(.*?)(?:(?:\r*\n){2})

Can you try re.match(r'(?sm).+CT Head', st).group(0)<\/code> ?

In pandas<\/code> , you can use extract<\/code> method as well.

import pandas as pd
import re

# Create a sample dataframe
df = pd.DataFrame([
    {'diagnosis': '''Patient Name :
NHI:  ABC2134
DOB:  10/03/1737

Patient Referred from: WTH ABC
Exam performed at:  XYZ Hospital Radiology
Reference:   ABCADAFAD
Date of exam:   12/11/2019
Examination(s) included in this report:
 CT Head

INDICATION:
Fall some time ago with ataxia since. Recent admission with 
tachybrady syndrome.'''}
])

pat = re.compile(r'^(.*Patient Referred.*?)(?:\r?\n){2}', re.DOTALL)
df_extracted = df.diagnosis.str.extract(pat, expand=True)

You could match (with re.DOTALL ):

^.+\r?\n *CT Head\r?\n

Demo

This regular expression can be broken down as follows.

^            # match beginning of string
.+           # match one or more characters, including line ​terminators
​\r?\n        # match line terminator (CR/LF or LF)
[ ]*CT Head  # match zero or more spaces followed by "CT Head"
​\r?\n        # match line terminator (CR/LF or LF)

In the above I put the space in a character class ( [ ] ) merely to make it visible. \\r? is needed for files created by Windows.


Alternatively, you could convert the match of the following regular expression (with re.DOTALL ) to an empty string.

(?:(?<= CT Head\n)|(?<= CT Head\r\n)).*

Demo

This regular expression can be broken down as follows.

(?:                 # begin non-capture group 
  (?<= CT Head\n)   # current position is preceded by " CT Head\n"   
|
  (?<= CT Head\r\n) # current position is preceded by " CT Head\r\n"   
)
.*                  # match zero or characters (to end of string)  

(?<=...) is a positive lookbehind . Note that Python does not support variable-length lookbehinds such as

(?<= CT Head\r?\n)

which is why two lookbehinds are needed.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM