简体   繁体   English

Python RegEx代码可检测句子中的特定功能

[英]Python RegEx code to detect specific features in a sentence

I created a simple word feature detector. 我创建了一个简单的单词特征检测器。 So far been able to find particular features (jumbled within) the string, but the algorithm get confused with certain sequences of words. 到目前为止,已经能够找到字符串的特定特征(混杂在其中),但是该算法与某些单词序列混淆了。 Let me illustrate: 让我说明一下:

from nltk.tokenize import word_tokenize
negative_descriptors = ['no', 'unlikely', 'no evidence of']
negative_descriptors = '|'.join(negative_descriptors)
negative_trailers = ['not present', 'not evident']
negative_trailers = '|'.join(negative_descriptors)

keywords = ['disc prolapse', 'vertebral osteomyelitis', 'collection']

def feature_match(message, keywords, negative_descriptors):
    if re.search(r"("+negative_descriptors+")" + r".*?" + r"("+keywords+")", message): return True
    if re.search(r"("+keywords+")" + r".*?" + r"("+negative_trailers+")", message): return True

The above returns True for the following messages: 上面对于以下消息返回True

message = 'There is no evidence of a collection.' 
message = 'A collection is not present.'

That is correct as it implies that the keyword/condition I am looking for is NOT present. 这是正确的,因为它暗示我正在寻找的关键字/条件不存在。 However, it returns None for the following messages: 但是,它为以下消息返回None

message = 'There is no evidence of disc prolapse, collection or vertebral osteomyelitis.'
message = 'There is no evidence of disc prolapse/vertebral osteomyelitis/ collection.'

It seem to be matching 'or vertebral osteomyelitis' in the first message and '/ collection' in the second message as negative matches, but this is wrong and implies that the message reads 'the condition that I am looking for IS present'. 它似乎在第一个消息中匹配“或椎骨骨髓炎”,而在第二个消息中匹配“ /收集”为否定匹配,但这是错误的,并暗示该消息显示为“我正在寻找IS存在的情况”。 It should really be returning 'True' instead. 它实际上应该返回“ True”。

How do I prevent this? 我该如何预防?

There are several problems with the code you posted : 您发布的代码存在几个问题:

  1. negative_trailers = '|'.join(negative_descriptors) should be negative_trailers = '|'.join(negative_trailers ) negative_trailers = '|'.join(negative_descriptors)应该是negative_trailers = '|'.join(negative_trailers )
  2. You should also convert your list keywords to string as you did for your other lists so that it can be passed to a regex 您还应该像其他列表一样将列表关键字转换为字符串,以便可以将其传递给正则表达式
  3. There is no use to use 3 times 'r' in your regex 在正则表达式中使用3倍“ r”是没有用的

After these corrections your code should look like this : 经过这些更正后,您的代码应如下所示:

negative_descriptors = ['no', 'unlikely', 'no evidence of']
negative_descriptors = '|'.join(negative_descriptors)
negative_trailers = ['not present', 'not evident']
negative_trailers = '|'.join(negative_trailers)

keywords = ['disc prolapse', 'vertebral osteomyelitis', 'collection']
keywords = '|'.join(keywords)

if re.search(r"("+negative_descriptors+").*("+keywords+")", message): neg_desc_present = True
if re.search(r"("+keywords+").*("+negative_trailers+")", message): neg_desc_present = True

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM