简体繁体中英

Fastest way to search for a long list of patterns in a text

原文 2021-12-08 04:51:28 4 1 search/ substring/ information-retrieval/ trie/ fst

Given a "large" list of patterns and a "short" text, what is the best/fastest way to search/tag those patterns in the text, where we are trying to find the pattern as a substring of the text? If there are multiple matches of a pattern in a text, we want to ideally find all of them.

To be more specific, the texts are actually streaming queries and the patterns to look for are named entities. We need an entire pattern to match in full. Training a NER model to tag entities is not an option. By "big" list, I mean a few hundred thousand entities to look up. By "short" text, I mean an average of 10 words.

eg :

Text: the actor who plays the black widow in the avengers .

I am considering tries and FSTs. Trying to understand the pros and cons of both in this particular scenario. Any pointers would be appreciated.

1 answers

You could take a look at the Aho-Corasick algorithm. This algorithm constructs a finite state machine from all search patterns, basically a trie but with some extra edges. It then uses this trie to search an input string for all search patterns simultaneously. The time complexity is O(n + m + z); n = length of input text, m = total characters in all search patterns, and z is the number of occurrences of search patterns in you input text.

However, this time complexity assumes you build the trie for each search, so if you build the trie up front (given it seems your search patterns do not change), and save it to memory, I think you can then search strings against the pre computed trie (finite state machine) in O(n) going forward.

Python: Fastest way to search if long string is in list of strings

Fastest way to search a list in python

How to search a string for a long list of patterns

What's the fastest way to search a very long list of words for a match in actionscript 3?

Efficient and fastest way to search in a list of strings

fastest way to search huge list of big texts

Fastest way to search for an object in a list in java

Fastest way to search through a list of hashes

Fastest way to search list for an element that begins with a string?

fastest way to do keyword search in large text in C/C++

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Tags

Fastest way to search for a long list of patterns in a text

Question

1 answers

solution1 1 2021-12-08 05:32:05

solution1
1 2021-12-08 05:32:05