[英]How to extract 3 and or more words after a specific word
I've been trying to extract 3 and or more words after Diagnosis:
or diagnosis:
to no avail.我一直在尝试在
Diagnosis:
或diagnosis:
之后提取 3 个或更多单词,但无济于事。
This is the code I've been trying:这是我一直在尝试的代码:
'diagnosis: \s+((?:\w+(?:\s+|$)){2})'
prints empty.打印为空。
I have managed to make this code work:我已经设法使这段代码工作:
"Diagnosis: (\w+)",
"diagnosis: (\w+)",
which gives me the immediate word after Diagnosis:
or diagnosis:
.这在
Diagnosis:
或diagnosis:
之后给了我直接的词。 How can I make it work for 3 or more words?我怎样才能让它适用于 3 个或更多单词?
#@title Extract Diagnosis { form-width: "20%" }
def extract_Diagnosis(clinical_information):
PATTERNS = [
"diagnosis: (\w+).",
"Diagnosis: (\w+).",
]
for pattern in PATTERNS:
matches = re.findall(pattern, clinical_information)
if len(matches) > 0:
break
Diagnosis = ''.join([t for t in matches if t.isalpha()])
return Diagnosis
for index, text in enumerate(texts):
print(extract_Diagnosis(text))
print("#"*79, index)
what I'm looking for is 3 or more words that come after diagnosis: or Diagnosis: in 20 pdfs.我正在寻找的是 diagnosis: 或 Diagnosis: 之后出现的 3 个或更多单词,在 20 个 pdf 中。 I've already turned the pdf to text and extracted the paragraph which "diagnosis:" is in (clinical information).
我已经将 pdf 转为文本并提取了“诊断:”所在的段落(临床信息)。
Ok, a new answer that focuses more on the problems with your code than problems with your regular expression.好的,一个新的答案更侧重于代码问题而不是正则表达式问题。 So first of all, your regular expression needs to be tweaked just a little bit by removing the initial space character and changing
2
to 3
:因此,首先,需要通过删除初始空格字符并将
2
更改为3
来稍微调整您的正则表达式:
diagnosis:\s+((?:\w+(?:\s+|$)){3})
Your code has a number of issues.您的代码有很多问题。 Here's a version of your code that kinda works, although it may not be doing exactly what you want:
这是您的代码的一个版本,虽然它可能不完全符合您的要求:
import re
def extract_Diagnosis(clinical_information):
PATTERNS = [r"diagnosis:\s+((?:\w+(?:\s+|$)){3})"]
matches = []
for pattern in PATTERNS:
matches = re.findall(pattern, clinical_information)
if len(matches) > 0:
break
Diagnosis = ''.join([t for t in matches])
return Diagnosis
texts = ["diagnosis: a b c blah blah blah diagnosis: asdf asdf asdf x x x "]
for index, text in enumerate(texts):
print(extract_Diagnosis(text))
print("#"*79, index)
Result:结果:
a b c asdf asdf asdf.
Here are the things I fixed with your code:以下是我用您的代码修复的问题:
r
to the front of the string constant containing the regular expression.r
。 This specifies a "raw string" in Python. You need to either do this or double up your backslashes.if t.isalpha()
.if t.isalpha()
过滤结果。 Given your expression, this will always be False
because what you are matching will always contain spaces as well as word characters.False
,因为您匹配的内容将始终包含空格和单词字符。 I see no reason for this test anyway, since you know exactly what you're getting because what you get matched your regular expression. I hope this helps!我希望这有帮助!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.