简体   繁体   English

如何在特定单词后提取 3 个或更多单词

[英]How to extract 3 and or more words after a specific word

I've been trying to extract 3 and or more words after Diagnosis: or diagnosis: to no avail.我一直在尝试在Diagnosis:diagnosis:之后提取 3 个或更多单词,但无济于事。

This is the code I've been trying:这是我一直在尝试的代码:

'diagnosis: \s+((?:\w+(?:\s+|$)){2})'

prints empty.打印为空。

I have managed to make this code work:我已经设法使这段代码工作:

"Diagnosis: (\w+)",
       "diagnosis: (\w+)",

which gives me the immediate word after Diagnosis: or diagnosis: .这在Diagnosis:diagnosis:之后给了我直接的词。 How can I make it work for 3 or more words?我怎样才能让它适用于 3 个或更多单词?

 #@title Extract Diagnosis { form-width: "20%" }


 def extract_Diagnosis(clinical_information):
  PATTERNS = [
    "diagnosis: (\w+).",
    "Diagnosis: (\w+).",
    

     ]

 for pattern in PATTERNS:
    matches = re.findall(pattern, clinical_information)
    if len(matches) > 0:
        break

   Diagnosis = ''.join([t for t in matches if t.isalpha()])

   return Diagnosis

    for index, text in enumerate(texts):
     print(extract_Diagnosis(text))
      print("#"*79, index)

what I'm looking for is 3 or more words that come after diagnosis: or Diagnosis: in 20 pdfs.我正在寻找的是 diagnosis: 或 Diagnosis: 之后出现的 3 个或更多单词,在 20 个 pdf 中。 I've already turned the pdf to text and extracted the paragraph which "diagnosis:" is in (clinical information).我已经将 pdf 转为文本并提取了“诊断:”所在的段落(临床信息)。

Ok, a new answer that focuses more on the problems with your code than problems with your regular expression.好的,一个新的答案更侧重于代码问题而不是正则表达式问题。 So first of all, your regular expression needs to be tweaked just a little bit by removing the initial space character and changing 2 to 3 :因此,首先,需要通过删除初始空格字符并将2更改为3来稍微调整您的正则表达式:

diagnosis:\s+((?:\w+(?:\s+|$)){3})

Your code has a number of issues.您的代码有很多问题。 Here's a version of your code that kinda works, although it may not be doing exactly what you want:这是您的代码的一个版本,虽然它可能不完全符合您的要求:

import re

def extract_Diagnosis(clinical_information):
    PATTERNS = [r"diagnosis:\s+((?:\w+(?:\s+|$)){3})"]
    matches = []
    for pattern in PATTERNS:
        matches = re.findall(pattern, clinical_information)
        if len(matches) > 0:
            break
    Diagnosis = ''.join([t for t in matches])
    return Diagnosis


texts = ["diagnosis: a b c    blah blah blah      diagnosis:   asdf asdf asdf  x x x "]

for index, text in enumerate(texts):
    print(extract_Diagnosis(text))
    print("#"*79, index)

Result:结果:

a b c    asdf asdf asdf. 

Here are the things I fixed with your code:以下是我用您的代码修复的问题:

  1. I replaced the two regular expressions with the one expression in your question, with the modifications mentioned above.我用你问题中的一个表达式替换了两个正则表达式,并进行了上述修改。
  2. I added a r to the front of the string constant containing the regular expression.我在包含正则表达式的字符串常量前面加了一个r This specifies a "raw string" in Python. You need to either do this or double up your backslashes.这在 Python 中指定了一个“原始字符串”。您需要执行此操作或将反斜杠加倍。
  3. You were filtering your results with the expression if t.isalpha() .您正在使用表达式if t.isalpha()过滤结果。 Given your expression, this will always be False because what you are matching will always contain spaces as well as word characters.根据您的表达式,这将始终为False ,因为您匹配的内容将始终包含空格和单词字符。 I see no reason for this test anyway, since you know exactly what you're getting because what you get matched your regular expression.无论如何,我认为没有理由进行此测试,因为您确切地知道您得到的是什么,因为您得到的与您的正则表达式相匹配。
  4. I fixed indentation so that everything worked.我修复了缩进,以便一切正常。 It may be that you had that right in your original code and it just got messed up moving it into your question.可能是您在原始代码中拥有该权利,但将其移入您的问题时却搞砸了。

I hope this helps!我希望这有帮助!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM