I am trying to apply regex on one of the columns in pandas dataframe, this column has text data in it, I am trying to extract a specific block. This is a sample of how my data will look like,
Patient Name :
NHI: ABC2134
DOB: 10/03/1737
Patient Referred from: WTH ABC
Exam performed at: XYZ Hospital Radiology
Reference: ABCADAFAD
Date of exam: 12/11/2019
Examination(s) included in this report:
CT Head
INDICATION:
Fall some time ago with ataxia since. Recent admission with
tachybrady syndrome.
It's because your regex is:
(?s)Patient Referred(.*?)(?:(?:\r*\n){2})
Can you try re.match(r'(?sm).+CT Head', st).group(0)<\/code> ?
In pandas<\/code> , you can use
extract<\/code> method as well.
import pandas as pd
import re
# Create a sample dataframe
df = pd.DataFrame([
{'diagnosis': '''Patient Name :
NHI: ABC2134
DOB: 10/03/1737
Patient Referred from: WTH ABC
Exam performed at: XYZ Hospital Radiology
Reference: ABCADAFAD
Date of exam: 12/11/2019
Examination(s) included in this report:
CT Head
INDICATION:
Fall some time ago with ataxia since. Recent admission with
tachybrady syndrome.'''}
])
pat = re.compile(r'^(.*Patient Referred.*?)(?:\r?\n){2}', re.DOTALL)
df_extracted = df.diagnosis.str.extract(pat, expand=True)
You could match (with re.DOTALL
):
^.+\r?\n *CT Head\r?\n
This regular expression can be broken down as follows.
^ # match beginning of string
.+ # match one or more characters, including line terminators
\r?\n # match line terminator (CR/LF or LF)
[ ]*CT Head # match zero or more spaces followed by "CT Head"
\r?\n # match line terminator (CR/LF or LF)
In the above I put the space in a character class ( [ ]
) merely to make it visible. \\r?
is needed for files created by Windows.
Alternatively, you could convert the match of the following regular expression (with re.DOTALL
) to an empty string.
(?:(?<= CT Head\n)|(?<= CT Head\r\n)).*
This regular expression can be broken down as follows.
(?: # begin non-capture group
(?<= CT Head\n) # current position is preceded by " CT Head\n"
|
(?<= CT Head\r\n) # current position is preceded by " CT Head\r\n"
)
.* # match zero or characters (to end of string)
(?<=...)
is a positive lookbehind . Note that Python does not support variable-length lookbehinds such as
(?<= CT Head\r?\n)
which is why two lookbehinds are needed.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.