简体   繁体   中英

Regex best practice: is it ok to use regex to match multiple phrases?

I have a list of short phrases that each should not be more than 5 word long, and I want to see if any of these phrases are in a certain text.

I want to write something like this:

my_phrases = ['Hello world', 'bye world', 'something something'....]
my_regex = re.compile('|'.join(my_phrases))

my_iter = re.finditer(my_regex, text)

But I'm kind of worried that this (line 2) is not considered a good practice. Can someone tell me if this is an OK thing to do? If not, what is the best way to match multiple phrases in text?

I would say your approach misses just one thing to be good practice: handling special characters in the original list of phrases: imagine the list is

['oh, really?', 'definitely!', 'no, never.']

Then your regex would also match "oh, reall this is" because ? means the "y" becomes optional, also it would match "no, neverending story" because "." means "any character".

To make the code "best practice" you need to pass the strings to a function that escapes such special characters, luckily re.escape is just such a function so you can simply use it to map all your strings:

my_phrases = ['Hello world', 'bye world', 'something something'....]
my_regex = re.compile('|'.join(map(re.escape, my_phrases)))
my_iter = re.finditer(my_regex, text)

Or (more readable):

my_phrases = ['Hello world', 'bye world', 'something something'....]
my_phrases_escaped = map(re.escape, my_phrases)
my_regex = re.compile('|'.join(my_phrases_escaped))
my_iter = re.finditer(my_regex, text)

I don't see any problem from the 'best practices' point of view. After all, the only algorithm I can think of is try the phrases one after another until one matches. Your regex does exacly that. If anything it may be a bit too rigid if you want to match it to 'Hello world' with two spaces instead of one. In that case regex'es are the way to go, you'd just need to make them 'Hello\\s+world' and so on.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM