简体   繁体   中英

Fastest way to search for a long list of patterns in a text

Given a "large" list of patterns and a "short" text, what is the best/fastest way to search/tag those patterns in the text, where we are trying to find the pattern as a substring of the text? If there are multiple matches of a pattern in a text, we want to ideally find all of them.

To be more specific, the texts are actually streaming queries and the patterns to look for are named entities. We need an entire pattern to match in full. Training a NER model to tag entities is not an option. By "big" list, I mean a few hundred thousand entities to look up. By "short" text, I mean an average of 10 words.

eg :

Text: the actor who plays the black widow in the avengers .

I am considering tries and FSTs. Trying to understand the pros and cons of both in this particular scenario. Any pointers would be appreciated.

You could take a look at the Aho-Corasick algorithm. This algorithm constructs a finite state machine from all search patterns, basically a trie but with some extra edges. It then uses this trie to search an input string for all search patterns simultaneously. The time complexity is O(n + m + z); n = length of input text, m = total characters in all search patterns, and z is the number of occurrences of search patterns in you input text.

However, this time complexity assumes you build the trie for each search, so if you build the trie up front (given it seems your search patterns do not change), and save it to memory, I think you can then search strings against the pre computed trie (finite state machine) in O(n) going forward.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM