简体   繁体   中英

How to extract 3 and or more words after a specific word

I've been trying to extract 3 and or more words after Diagnosis: or diagnosis: to no avail.

This is the code I've been trying:

'diagnosis: \s+((?:\w+(?:\s+|$)){2})'

prints empty.

I have managed to make this code work:

"Diagnosis: (\w+)",
       "diagnosis: (\w+)",

which gives me the immediate word after Diagnosis: or diagnosis: . How can I make it work for 3 or more words?

 #@title Extract Diagnosis { form-width: "20%" }


 def extract_Diagnosis(clinical_information):
  PATTERNS = [
    "diagnosis: (\w+).",
    "Diagnosis: (\w+).",
    

     ]

 for pattern in PATTERNS:
    matches = re.findall(pattern, clinical_information)
    if len(matches) > 0:
        break

   Diagnosis = ''.join([t for t in matches if t.isalpha()])

   return Diagnosis

    for index, text in enumerate(texts):
     print(extract_Diagnosis(text))
      print("#"*79, index)

what I'm looking for is 3 or more words that come after diagnosis: or Diagnosis: in 20 pdfs. I've already turned the pdf to text and extracted the paragraph which "diagnosis:" is in (clinical information).

Ok, a new answer that focuses more on the problems with your code than problems with your regular expression. So first of all, your regular expression needs to be tweaked just a little bit by removing the initial space character and changing 2 to 3 :

diagnosis:\s+((?:\w+(?:\s+|$)){3})

Your code has a number of issues. Here's a version of your code that kinda works, although it may not be doing exactly what you want:

import re

def extract_Diagnosis(clinical_information):
    PATTERNS = [r"diagnosis:\s+((?:\w+(?:\s+|$)){3})"]
    matches = []
    for pattern in PATTERNS:
        matches = re.findall(pattern, clinical_information)
        if len(matches) > 0:
            break
    Diagnosis = ''.join([t for t in matches])
    return Diagnosis


texts = ["diagnosis: a b c    blah blah blah      diagnosis:   asdf asdf asdf  x x x "]

for index, text in enumerate(texts):
    print(extract_Diagnosis(text))
    print("#"*79, index)

Result:

a b c    asdf asdf asdf. 

Here are the things I fixed with your code:

  1. I replaced the two regular expressions with the one expression in your question, with the modifications mentioned above.
  2. I added a r to the front of the string constant containing the regular expression. This specifies a "raw string" in Python. You need to either do this or double up your backslashes.
  3. You were filtering your results with the expression if t.isalpha() . Given your expression, this will always be False because what you are matching will always contain spaces as well as word characters. I see no reason for this test anyway, since you know exactly what you're getting because what you get matched your regular expression.
  4. I fixed indentation so that everything worked. It may be that you had that right in your original code and it just got messed up moving it into your question.

I hope this helps!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM