I've been trying to extract 3 and or more words after Diagnosis:
or diagnosis:
to no avail.
This is the code I've been trying:
'diagnosis: \s+((?:\w+(?:\s+|$)){2})'
prints empty.
I have managed to make this code work:
"Diagnosis: (\w+)",
"diagnosis: (\w+)",
which gives me the immediate word after Diagnosis:
or diagnosis:
. How can I make it work for 3 or more words?
#@title Extract Diagnosis { form-width: "20%" }
def extract_Diagnosis(clinical_information):
PATTERNS = [
"diagnosis: (\w+).",
"Diagnosis: (\w+).",
]
for pattern in PATTERNS:
matches = re.findall(pattern, clinical_information)
if len(matches) > 0:
break
Diagnosis = ''.join([t for t in matches if t.isalpha()])
return Diagnosis
for index, text in enumerate(texts):
print(extract_Diagnosis(text))
print("#"*79, index)
what I'm looking for is 3 or more words that come after diagnosis: or Diagnosis: in 20 pdfs. I've already turned the pdf to text and extracted the paragraph which "diagnosis:" is in (clinical information).
Ok, a new answer that focuses more on the problems with your code than problems with your regular expression. So first of all, your regular expression needs to be tweaked just a little bit by removing the initial space character and changing 2
to 3
:
diagnosis:\s+((?:\w+(?:\s+|$)){3})
Your code has a number of issues. Here's a version of your code that kinda works, although it may not be doing exactly what you want:
import re
def extract_Diagnosis(clinical_information):
PATTERNS = [r"diagnosis:\s+((?:\w+(?:\s+|$)){3})"]
matches = []
for pattern in PATTERNS:
matches = re.findall(pattern, clinical_information)
if len(matches) > 0:
break
Diagnosis = ''.join([t for t in matches])
return Diagnosis
texts = ["diagnosis: a b c blah blah blah diagnosis: asdf asdf asdf x x x "]
for index, text in enumerate(texts):
print(extract_Diagnosis(text))
print("#"*79, index)
Result:
a b c asdf asdf asdf.
Here are the things I fixed with your code:
r
to the front of the string constant containing the regular expression. This specifies a "raw string" in Python. You need to either do this or double up your backslashes.if t.isalpha()
. Given your expression, this will always be False
because what you are matching will always contain spaces as well as word characters. I see no reason for this test anyway, since you know exactly what you're getting because what you get matched your regular expression.I hope this helps!
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.