[英]python regex to select everything before and after a particular string
我正在尝试在 pandas 数据框中的一列上应用正则表达式,该列中包含文本数据,我正在尝试提取特定块。 这是我的数据的样例,
Patient Name :
NHI: ABC2134
DOB: 10/03/1737
Patient Referred from: WTH ABC
Exam performed at: XYZ Hospital Radiology
Reference: ABCADAFAD
Date of exam: 12/11/2019
Examination(s) included in this report:
CT Head
INDICATION:
Fall some time ago with ataxia since. Recent admission with
tachybrady syndrome.
这是因为您的正则表达式是:
(?s)Patient Referred(.*?)(?:(?:\r*\n){2})
你可以试试
re.match(r'(?sm).+CT Head', st).group(0)<\/code>吗?
在
pandas<\/code>中,您也可以使用
extract<\/code>方法。
import pandas as pd
import re
# Create a sample dataframe
df = pd.DataFrame([
{'diagnosis': '''Patient Name :
NHI: ABC2134
DOB: 10/03/1737
Patient Referred from: WTH ABC
Exam performed at: XYZ Hospital Radiology
Reference: ABCADAFAD
Date of exam: 12/11/2019
Examination(s) included in this report:
CT Head
INDICATION:
Fall some time ago with ataxia since. Recent admission with
tachybrady syndrome.'''}
])
pat = re.compile(r'^(.*Patient Referred.*?)(?:\r?\n){2}', re.DOTALL)
df_extracted = df.diagnosis.str.extract(pat, expand=True)
您可以匹配(使用re.DOTALL
):
^.+\r?\n *CT Head\r?\n
这个正则表达式可以分解如下。
^ # match beginning of string
.+ # match one or more characters, including line terminators
\r?\n # match line terminator (CR/LF or LF)
[ ]*CT Head # match zero or more spaces followed by "CT Head"
\r?\n # match line terminator (CR/LF or LF)
在上面,我将空格放在字符类( [ ]
)中只是为了使其可见。 \\r?
Windows 创建的文件需要。
或者,您可以将以下正则表达式的匹配项(使用re.DOTALL
)转换为空字符串。
(?:(?<= CT Head\n)|(?<= CT Head\r\n)).*
这个正则表达式可以分解如下。
(?: # begin non-capture group
(?<= CT Head\n) # current position is preceded by " CT Head\n"
|
(?<= CT Head\r\n) # current position is preceded by " CT Head\r\n"
)
.* # match zero or characters (to end of string)
(?<=...)
是一个积极的回顾。 请注意,Python 不支持可变长度的lookbehinds,例如
(?<= CT Head\r?\n)
这就是为什么需要两个lookbehinds。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.