简体   繁体   中英

Fastest way to check over 1500 regex pattern match on the same string

I have over 1500 given regular expression patterns, that need to be run on the same 100 - 200 kb text files and return list of success patterns. Files come from outside, so I can't do any assumption about that file.

The question is, can I somehow make processing faster than running all this regexes to the same text?

Logically the input file is the same, and later regexes can use some information that already have been processed. If we take that each regex is finite automate, than running 1500 finite automates to the same text, is definitely slower than runinng one joined automate. So the question is, can I somehow create that joined regex?

This is a perfect opportunity to take advantage of threading. Read in your to be processed file into a string, then spin up a series of consumer threads. Have your main thread put each regular expression into a queue, then have the consumers break off the next piece of the queue, compile the regex, and run it on the string. The shared memory means you can have several expressions running on the same string, and even on a weak computer (2 cores, not hyperthreaded) you'll notice a significant speed boost if you keep your consumer pool to a reasonable size. On a really big server - say 32 cores with hyperthreading? You can have a nice fat pool and blast through those regular expressions in no time.

I think it's possible in theory but seems like a non-trivial task. A possible approach could be:

  1. Convert all regexes to finite state machines.
  2. Combine these into a single fsm.
  3. Optimize the generated states.

Optimization would be a key step since the inputs are lengthy (100-200kb). Memory could be a concern and performance could go for worse otherwise. I don't know if a library exists for this purpose but here's a theoretical answer .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM