简体繁体 English

在C＃中针对10,000个文件优化1000个正则表达式搜索

[英]Optimizing 1000 regex searches across 10,000 files in C#

原文 2014-12-16 19:46:09 7 3 c#/ regex

Given a file that has around 1000 individual regex searches that must be applied across 10,000 other files or so, I'm looking for a nice way to make this run in less than a day's time. 给定一个文件，该文件具有大约1000个单独的正则表达式搜索，并且必须在大约10,000个其他文件中应用该正则表达式搜索，因此我正在寻找一种不错的方法来使它在不到一天的时间内运行。 No search and replacing going on, just straight match checking. 无需搜索和替换，只需进行直接匹配检查即可。 I can manually go through and combine many of these, but I'm wondering if there is an automated tool to do this and perhaps eveng get it down to a single search. 我可以手动检查并合并其中的许多功能，但是我想知道是否有自动化工具可以做到这一点，甚至可以将其简化为一次搜索。

Also wondering if this hasn't been asked 1000 times as well, but my google-foo is failing me here. 也想知道是否也没有问过这个问题1000次，但是我的google-foo在这里让我失望了。

Edit: Using an external tool is a possibilitiy, but this has to be run through a script where those 1000 or so searches as passed though and all the results are prettied up into a nice report. 编辑：使用外部工具是可能的，但是必须通过一个脚本来运行，在该脚本中，通过了大约1000个搜索，并且所有结果都以漂亮的报告形式显示。 I have the side of things done in C#, but it is painfully slow, hence the combine idea or other ways of making it go faster. 我在C＃中完成了事情，但是它的速度很慢，因此结合了想法或其他方法可以使它运行得更快。 BTW, I do have it threaded already as well. 顺便说一句，我确实已经将其线程化。

3 个解决方案

First place for improvement is to make sure your loops are nested in the right order. 改进的第一要务是确保循环以正确的顺序嵌套。

Loop over the files, and for each file, try all 1000 patterns before moving to the next file. 循环浏览文件，对于每个文件，请尝试所有1000种模式，然后再移动到下一个文件。 That way you only have to open each file once, not 1000 times. 这样，您只需打开每个文件一次，而无需打开1000次。

Second idea is to use a parser generation like flex+yacc+bison to precompile a single DFA (deterministic finite automaton) that covers all the patterns. 第二个想法是使用诸如flex + yacc + bison之类的解析器生成器来预编译涵盖所有模式的单个DFA（确定性有限自动机）。 In the same way that matching against an entire dictionary can use a trie, matching against a list of patterns can generally be done using a state machine with a lot less computation than matching each pattern separately (basically: where and how one pattern fails to match contains information on what patterns might fit in that region) 以与对整个字典进行匹配可以使用特里的方式相同，与模式列表进行匹配通常可以使用状态机完成，而状态机的计算量要比分别对每个模式进行匹配要少得多（基本上：一个模式在哪里以及如何不匹配）包含有关该区域可能适用的模式的信息）

Found something interesting: 发现了一些有趣的东西：

Here is a Regex combiner that runs a Perl script to take all the strings and put them into the One String to rule them all. 这是一个Regex组合器，它运行Perl脚本以获取所有字符串，并将它们放入一个字符串中以统治所有字符串。

Of course, the string that was output from my source strings was roughly 7000 characters in length and that caused C#'s Regex implementaion to blow up spectacularly after roughly 1024 characters - because 1024 characters is all any sane person would need. 当然，从我的源字符串输出的字符串的长度大约为7000个字符，这导致C＃的Regex实现在大约1024个字符之后急剧膨胀-因为任何理智的人都需要1024个字符。 2^10 is very magical here. 2 ^ 10在这里非常神奇。

So who knows if it ran faster after combining because the combination failed too :) 那么谁知道组合后它是否运行得更快，因为组合也失败了:)

Edit: Hmm, changed a couple of minor things within the search string (removed \\ on quotes) and it runs. 编辑：嗯，改变了搜索字符串中的一些小东西（在引号上删除了\\），它开始运行。 Now, time to see if the results of the One String match the results of the 1000 strings. 现在，该看看“一个字符串”的结果是否与1000个字符串的结果匹配。 It does appear to run x10 faster though, which is very nice! 它的确似乎可以更快地运行x10，这非常好！

I ended up taking a big step back, realizing that 99% of the searches could be easily converted into straight string searches with just a little bit of work, and then using a simple hashset search per word. 最后，我迈出了一大步，意识到只需少量工作即可轻松将99％的搜索转换为直接字符串搜索，然后对每个单词使用简单的哈希集搜索。 The remaining searches that needed GREP I left be and it ran plenty fast. 剩下需要我进行GREP的其余搜索，并且运行速度很快。 KISS. 吻。