[英]Python RegEx code to detect specific features in a sentence
I created a simple word feature detector. 我创建了一个简单的单词特征检测器。 So far been able to find particular features (jumbled within) the string, but the algorithm get confused with certain sequences of words.
到目前为止,已经能够找到字符串的特定特征(混杂在其中),但是该算法与某些单词序列混淆了。 Let me illustrate:
让我说明一下:
from nltk.tokenize import word_tokenize
negative_descriptors = ['no', 'unlikely', 'no evidence of']
negative_descriptors = '|'.join(negative_descriptors)
negative_trailers = ['not present', 'not evident']
negative_trailers = '|'.join(negative_descriptors)
keywords = ['disc prolapse', 'vertebral osteomyelitis', 'collection']
def feature_match(message, keywords, negative_descriptors):
if re.search(r"("+negative_descriptors+")" + r".*?" + r"("+keywords+")", message): return True
if re.search(r"("+keywords+")" + r".*?" + r"("+negative_trailers+")", message): return True
The above returns True
for the following messages: 上面对于以下消息返回
True
:
message = 'There is no evidence of a collection.'
message = 'A collection is not present.'
That is correct as it implies that the keyword/condition I am looking for is NOT present. 这是正确的,因为它暗示我正在寻找的关键字/条件不存在。 However, it returns
None
for the following messages: 但是,它为以下消息返回
None
:
message = 'There is no evidence of disc prolapse, collection or vertebral osteomyelitis.'
message = 'There is no evidence of disc prolapse/vertebral osteomyelitis/ collection.'
It seem to be matching 'or vertebral osteomyelitis' in the first message and '/ collection' in the second message as negative matches, but this is wrong and implies that the message reads 'the condition that I am looking for IS present'. 它似乎在第一个消息中匹配“或椎骨骨髓炎”,而在第二个消息中匹配“ /收集”为否定匹配,但这是错误的,并暗示该消息显示为“我正在寻找IS存在的情况”。 It should really be returning 'True' instead.
它实际上应该返回“ True”。
How do I prevent this? 我该如何预防?
There are several problems with the code you posted : 您发布的代码存在几个问题:
negative_trailers = '|'.join(negative_descriptors)
should be negative_trailers = '|'.join(negative_trailers )
negative_trailers = '|'.join(negative_descriptors)
应该是negative_trailers = '|'.join(negative_trailers )
After these corrections your code should look like this : 经过这些更正后,您的代码应如下所示:
negative_descriptors = ['no', 'unlikely', 'no evidence of']
negative_descriptors = '|'.join(negative_descriptors)
negative_trailers = ['not present', 'not evident']
negative_trailers = '|'.join(negative_trailers)
keywords = ['disc prolapse', 'vertebral osteomyelitis', 'collection']
keywords = '|'.join(keywords)
if re.search(r"("+negative_descriptors+").*("+keywords+")", message): neg_desc_present = True
if re.search(r"("+keywords+").*("+negative_trailers+")", message): neg_desc_present = True
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.