简体   繁体   中英

Search for like and not like keywords in a string Python

I have a list of where each entity has some like and not like keywords mapped to it. For example:

Entity -batteries plus
like keywords -%batteri%pl% not-like keywords - %interstate batteri%|%el toro water%|%osibatteries%

Now in total this list is 2000 entities long and each entity on an average has 3-4 like keywords and only some have not like keywords, on an average 2-3.

These keywords are separated by |and % between words in a single keyword means they need not to be consecutive. % at the beginning and end means keyword can be anywhere in the input string.

My input string is of length on average 8 words with 99% times less than 10 words and I need to get which entities are present in the transaction. How can I do this effectively in terms of time complexity.

EDIT

Input string sample : Purchase batteri xx02 pl jacksonville Fl

Expected Output : batteries plus

Explanation : as the like keyword is present in the input string and none of not like keyword is present, so we can say that the entity is batteries plus

I'will answer it with the approach that is currently being into use:

create a inverted dict with keys as all the words we have in keywords file and value as tuple of (like keyword, not like keyword, entity name) so for a single example in question it would be

{'batteri':('%batteri%pl%,'%interstate batteri%|%el toro water%|%osibatteries%','batteries plus'),
 'pl': ('%batteri%pl%','%interstate batteri%|%el toro water%|%osibatteries%','batteries plus')}

Now split the input string on space and search each word into this inverted dict and for each match, get the full like keyword, full not like keyword and do complete regex search on that. If like keyword is present and not like is not there ie successful match, return the corresponding identity.

Time complexity: Since the input string is almost always less than 10 words, it will have ~10 dict get operations. After that it's the set intersection between input string set(based on space split) and the like & not like keywords with a exit mechanism at the first success.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM