简体   繁体   中英

Regex fuzzy word match

Tough regex question: I want to use regexes to extract information from news sentences about crackdowns. Here are some examples:

doc1 = "5 young students arrested"
doc2 = "10 rebels were reported killed"

I want to match sentences based on lists of entities and outcomes:

entities = ['students','rebels']
outcomes = ['arrested','killed']

How can I use a regex to extract the number of participants from 0-99999, any of the entities, any of the outcomes, all while ignoring random text (such as 'young' or 'were reported')? This is what I have:

re.findall(r'\d{1,5} \D{1,50}'+ '|'.join(entities) + '\D{1,50}' + '|'.join(outcomes),doc1)

ie, a number, some optional random text, an entity, some more optional random text, and an outcome. Something is going wrong, I think because of the OR statements. Thanks for your help!

This regex should match your two examples:

pattern = r'\d+\s+.*?(' + '|'.join(entities) + r').*?(' + '|'.join(outcomes) + ')'

What you were missing were parentheses around the ORs.

However, using only regex likely won't give you good results. Consider using Natural Language Processing libraries like NLTK that parses sentences.

As @ReutSharabani already answered, this is not a proper way to do nlp, but this answers the literal question.

The regex should read:

import re;
entities = ['students','rebels'];
outcomes = ['arrested','killed'];
p = re.compile(r'(\d{1,5})\D{1,50}('+'|'.join(entities)+')\D{1,50}('+'|'.join(outcomes)+')');
m = p.match(doc1);
number = m.group(1);
entity = m.group(2);
outcome = m.group(3);

You forgot to group () your OR-operations. Instead what you generated was a|b|\\W|c|d|\\W (short version).

You ought to try out the regex module! It has built in fuzzy match capabilities. The other answers seem much more robust and sleek, but this could be done simply with fuzzy matching as well!

pattern = r'\d{1,5}(%(entities)s)(%(outcomes)s){i}' %{'entities' : '|'.join(entities), 'outcomes' : '|'.join(outcomes)}
regex.match(pattern, news_sentence)

What's happening here is that the {i} indicates you want a match with any number of inserts. The problem here is that it could insert characters into one of the entities or outcomes and still yield a match. If you want to accept slight alterations on spelling to any of your outcomes or entities, then you could also use {e<=1} or something. Read more in the provided link about approximate matching!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM