简体   繁体   English

正则表达式模糊词匹配

[英]Regex fuzzy word match

Tough regex question: I want to use regexes to extract information from news sentences about crackdowns. 棘手的正则表达式问题:我想使用正则表达式从新闻中提取有关打击的信息。 Here are some examples: 这里有些例子:

doc1 = "5 young students arrested"
doc2 = "10 rebels were reported killed"

I want to match sentences based on lists of entities and outcomes: 我想根据实体和结果列表匹配句子:

entities = ['students','rebels']
outcomes = ['arrested','killed']

How can I use a regex to extract the number of participants from 0-99999, any of the entities, any of the outcomes, all while ignoring random text (such as 'young' or 'were reported')? 在忽略随机文本(例如“年轻”或“被报告”)的同时,如何使用正则表达式从0-99999,任何实体,任何结果中提取参与者的数量? This is what I have: 这就是我所拥有的:

re.findall(r'\d{1,5} \D{1,50}'+ '|'.join(entities) + '\D{1,50}' + '|'.join(outcomes),doc1)

ie, a number, some optional random text, an entity, some more optional random text, and an outcome. 例如,数字,一些可选的随机文本,实体,更多可选的随机文本和结果。 Something is going wrong, I think because of the OR statements. 我认为是因为OR语句出了问题。 Thanks for your help! 谢谢你的帮助!

This regex should match your two examples: 此正则表达式应与您的两个示例匹配:

pattern = r'\d+\s+.*?(' + '|'.join(entities) + r').*?(' + '|'.join(outcomes) + ')'

What you were missing were parentheses around the ORs. 您所缺少的是手术室周围的括号。

However, using only regex likely won't give you good results. 但是,仅使用正则表达式可能不会为您带来良好的效果。 Consider using Natural Language Processing libraries like NLTK that parses sentences. 考虑使用自然语言处理库(例如NLTK)来解析句子。

As @ReutSharabani already answered, this is not a proper way to do nlp, but this answers the literal question. 正如@ReutSharabani已经回答的那样,这不是执行nlp的正确方法,但这可以回答字面上的问题。

The regex should read: 正则表达式应为:

import re;
entities = ['students','rebels'];
outcomes = ['arrested','killed'];
p = re.compile(r'(\d{1,5})\D{1,50}('+'|'.join(entities)+')\D{1,50}('+'|'.join(outcomes)+')');
m = p.match(doc1);
number = m.group(1);
entity = m.group(2);
outcome = m.group(3);

You forgot to group () your OR-operations. 您忘记了对()操作进行分组。 Instead what you generated was a|b|\\W|c|d|\\W (short version). 相反,您生成的是a|b|\\W|c|d|\\W (简短版本)。

You ought to try out the regex module! 您应该尝试正则表达式模块! It has built in fuzzy match capabilities. 它具有内置的模糊匹配功能。 The other answers seem much more robust and sleek, but this could be done simply with fuzzy matching as well! 其他答案似乎更加健壮和流畅,但这也可以简单地通过模糊匹配来完成!

pattern = r'\d{1,5}(%(entities)s)(%(outcomes)s){i}' %{'entities' : '|'.join(entities), 'outcomes' : '|'.join(outcomes)}
regex.match(pattern, news_sentence)

What's happening here is that the {i} indicates you want a match with any number of inserts. 这里发生的是{i}表示您希望匹配任意数量的插入。 The problem here is that it could insert characters into one of the entities or outcomes and still yield a match. 这里的问题是它可以将字符插入到实体或结果之一中,而仍然产生匹配项。 If you want to accept slight alterations on spelling to any of your outcomes or entities, then you could also use {e<=1} or something. 如果您想接受对任何结果或实体的拼写进行细微改动,则也可以使用{e <= 1}或其他内容。 Read more in the provided link about approximate matching! 在提供的链接中了解有关近似匹配的更多信息!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM