简体   繁体   English

检查字符串是否与ruby中的正则表达式匹配的最快方法?

[英]Fastest way to check if a string matches a regexp in ruby?

What is the fastest way to check if a string matches a regular expression in Ruby? 检查字符串是否与Ruby中的正则表达式匹配的最快方法是什么?

My problem is that I have to "egrep" through a huge list of strings to find which are the ones that match a regexp that is given at runtime. 我的问题是我必须通过一个庞大的字符串列表“egrep”来找到哪些是与运行时给出的正则表达式相匹配的字符串。 I only care about whether the string matches the regexp, not where it matches, nor what the content of the matching groups is. 我只关心字符串是否与正则表达式匹配,而不是它匹配的位置,以及匹配组的内容是什么。 I hope this assumption can be used to reduce the amount of time my code spend matching regexps. 我希望这个假设可以用来减少我的代码花费匹配正则表达式的时间。

I load the regexp with 我加载正则表达式

pattern = Regexp.new(ptx).freeze

I have found that string =~ pattern is slightly faster than string.match(pattern) . 我发现string =~ patternstring.match(pattern)略快。

Are there other tricks or shortcuts that can used to make this test even faster? 还有其他技巧或捷径可以用来使这项测试更快吗?

Starting with Ruby 2.4.0, you may use RegExp#match? 从Ruby 2.4.0开始,您可以使用RegExp#match? :

pattern.match?(string)

Regexp#match? is explicitly listed as a performance enhancement in the release notes for 2.4.0 , as it avoids object allocations performed by other methods such as Regexp#match and =~ : 2.4.0发行说明中明确列为性能增强,因为它避免了其他方法(如Regexp#match=~执行的对象分配:

Regexp#match? 正则表达式匹配#?
Added Regexp#match? 添加了Regexp#match? , which executes a regexp match without creating a back reference object and changing $~ to reduce object allocation. ,执行正则表达式匹配而不创建后引用对象并更改$~以减少对象分配。

This is a simple benchmark: 这是一个简单的基准:

require 'benchmark'

"test123" =~ /1/
=> 4
Benchmark.measure{ 1000000.times { "test123" =~ /1/ } }
=>   0.610000   0.000000   0.610000 (  0.578133)

"test123"[/1/]
=> "1"
Benchmark.measure{ 1000000.times { "test123"[/1/] } }
=>   0.718000   0.000000   0.718000 (  0.750010)

irb(main):019:0> "test123".match(/1/)
=> #<MatchData "1">
Benchmark.measure{ 1000000.times { "test123".match(/1/) } }
=>   1.703000   0.000000   1.703000 (  1.578146)

So =~ is faster but it depends what you want to have as a returned value. 所以=~更快,但它取决于你想要的东西作为返回值。 If you just want to check if the text contains a regex or not use =~ 如果你只想检查文本是否包含正则表达式,请使用=~

This is the benchmark I have run after finding some articles around the net. 这是我在网上找到一些文章后运行的基准。

With 2.4.0 the winner is re.match?(str) (as suggested by @wiktor-stribiżew), on previous versions, re =~ str seems to be fastest, although str =~ re is almost as fast. 在2.4.0中,获胜者是re.match?(str) (正如@ wiktor-stribiżew所建议的),在以前的版本中, re =~ str 〜str似乎是最快的,尽管str =~ re 〜re几乎一样快。

#!/usr/bin/env ruby
require 'benchmark'

str = "aacaabc"
re = Regexp.new('a+b').freeze

N = 4_000_000

Benchmark.bm do |b|
    b.report("str.match re\t") { N.times { str.match re } }
    b.report("str =~ re\t")    { N.times { str =~ re } }
    b.report("str[re]  \t")    { N.times { str[re] } }
    b.report("re =~ str\t")    { N.times { re =~ str } }
    b.report("re.match str\t") { N.times { re.match str } }
    if re.respond_to?(:match?)
        b.report("re.match? str\t") { N.times { re.match? str } }
    end
end

Results MRI 1.9.3-o551: 结果MRI 1.9.3-o551:

$ ./bench-re.rb  | sort -t $'\t' -k 2
       user     system      total        real
re =~ str         2.390000   0.000000   2.390000 (  2.397331)
str =~ re         2.450000   0.000000   2.450000 (  2.446893)
str[re]           2.940000   0.010000   2.950000 (  2.941666)
re.match str      3.620000   0.000000   3.620000 (  3.619922)
str.match re      4.180000   0.000000   4.180000 (  4.180083)

Results MRI 2.1.5: 结果MRI 2.1.5:

$ ./bench-re.rb  | sort -t $'\t' -k 2
       user     system      total        real
re =~ str         1.150000   0.000000   1.150000 (  1.144880)
str =~ re         1.160000   0.000000   1.160000 (  1.150691)
str[re]           1.330000   0.000000   1.330000 (  1.337064)
re.match str      2.250000   0.000000   2.250000 (  2.255142)
str.match re      2.270000   0.000000   2.270000 (  2.270948)

Results MRI 2.3.3 (there is a regression in regex matching, it seems): 结果MRI 2.3.3(正则表达式匹配中存在回归,似乎):

$ ./bench-re.rb  | sort -t $'\t' -k 2
       user     system      total        real
re =~ str         3.540000   0.000000   3.540000 (  3.535881)
str =~ re         3.560000   0.000000   3.560000 (  3.560657)
str[re]           4.300000   0.000000   4.300000 (  4.299403)
re.match str      5.210000   0.010000   5.220000 (  5.213041)
str.match re      6.000000   0.000000   6.000000 (  6.000465)

Results MRI 2.4.0: 结果MRI 2.4.0:

$ ./bench-re.rb  | sort -t $'\t' -k 2
       user     system      total        real
re.match? str     0.690000   0.010000   0.700000 (  0.682934)
re =~ str         1.040000   0.000000   1.040000 (  1.035863)
str =~ re         1.040000   0.000000   1.040000 (  1.042963)
str[re]           1.340000   0.000000   1.340000 (  1.339704)
re.match str      2.040000   0.000000   2.040000 (  2.046464)
str.match re      2.180000   0.000000   2.180000 (  2.174691)

What about re === str (case compare)? 怎么样re === str (案例比较)?

Since it evaluates to true or false and has no need for storing matches, returning match index and that stuff, I wonder if it would be an even faster way of matching than =~ . 由于它的计算结果为true或false,并且不需要存储匹配,返回匹配索引和那些东西,我想知道它是否会比=~更快地匹配。


Ok, I tested this. 好的,我测试了这个。 =~ is still faster, even if you have multiple capture groups, however it is faster than the other options. =~仍然更快,即使你有多个捕获组,但它比其他选项更快。

BTW, what good is freeze ? 顺便说一句, freeze什么好处? I couldn't measure any performance boost from it. 我无法衡量它的性能提升。

Depending on how complicated your regular expression is, you could possibly just use simple string slicing. 根据正则表达式的复杂程度,您可以使用简单的字符串切片。 I'm not sure about the practicality of this for your application or whether or not it would actually offer any speed improvements. 我不确定这对您的应用程序的实用性,或者它是否真的会提供任何速度改进。

'testsentence'['stsen']
=> 'stsen' # evaluates to true
'testsentence'['koala']
=> nil # evaluates to false

What I am wondering is if there is any strange way to make this check even faster, maybe exploiting some strange method in Regexp or some weird construct. 我想知道的是,如果有任何奇怪的方法使这个检查更快,可能在Regexp或一些奇怪的构造中利用一些奇怪的方法。

Regexp engines vary in how they implement searches, but, in general, anchor your patterns for speed, and avoid greedy matches, especially when searching long strings. Regexp引擎在实现搜索的方式上有所不同,但一般情况下,锚定模式以提高速度,避免贪婪匹配,尤其是在搜索长字符串时。

The best thing to do, until you're familiar with how a particular engine works, is to do benchmarks and add/remove anchors, try limiting searches, use wildcards vs. explicit matches, etc. 在熟悉特定引擎的工作原理之前,最好的办法是做基准测试,添加/删除锚点,尝试限制搜索,使用通配符和显式匹配等。

The Fruity gem is very useful for quickly benchmarking things, because it's smart. Fruity gem对于快速对事物进行基准测试非常有用,因为它很聪明。 Ruby's built-in Benchmark code is also useful, though you can write tests that fool you by not being careful. Ruby的内置Benchmark代码也很有用,尽管你可以编写那些通过不小心来欺骗你的测试。

I've used both in many answers here on Stack Overflow, so you can search through my answers and will see lots of little tricks and results to give you ideas of how to write faster code. 我已经在Stack Overflow上的许多答案中使用了这两个,所以你可以搜索我的答案,并会看到很多小技巧和结果,以便为你提供如何编写更快代码的想法。

The biggest thing to remember is, it's bad to prematurely optimize your code before you know where the slowdowns occur. 要记住的最重要的事情是,在知道减速发生的位置之前过早优化代码是不好的。

To complete Wiktor Stribiżew and Dougui answers I would say that /regex/.match?("string") about as fast as "string".match?(/regex/) . 要完成Wiktor的StribiżewDougui答案我会说, /regex/.match?("string")一样快"string".match?(/regex/)

Ruby 2.4.0 (10 000 000 ~2 sec) Ruby 2.4.0(10 000 000~2秒)

2.4.0 > require 'benchmark'
 => true 
2.4.0 > Benchmark.measure{ 10000000.times { /^CVE-[0-9]{4}-[0-9]{4,}$/.match?("CVE-2018-1589") } }
 => #<Benchmark::Tms:0x005563da1b1c80 @label="", @real=2.2060338060000504, @cstime=0.0, @cutime=0.0, @stime=0.04000000000000001, @utime=2.17, @total=2.21> 
2.4.0 > Benchmark.measure{ 10000000.times { "CVE-2018-1589".match?(/^CVE-[0-9]{4}-[0-9]{4,}$/) } }
 => #<Benchmark::Tms:0x005563da139eb0 @label="", @real=2.260814556000696, @cstime=0.0, @cutime=0.0, @stime=0.010000000000000009, @utime=2.2500000000000004, @total=2.2600000000000007> 

Ruby 2.6.2 (100 000 000 ~20 sec) Ruby 2.6.2(100 000 000~20 sec)

irb(main):001:0> require 'benchmark'
=> true
irb(main):005:0> Benchmark.measure{ 100000000.times { /^CVE-[0-9]{4}-[0-9]{4,}$/.match?("CVE-2018-1589") } }
=> #<Benchmark::Tms:0x0000562bc83e3768 @label="", @real=24.60139879199778, @cstime=0.0, @cutime=0.0, @stime=0.010000999999999996, @utime=24.565644999999996, @total=24.575645999999995>
irb(main):004:0> Benchmark.measure{ 100000000.times { "CVE-2018-1589".match?(/^CVE-[0-9]{4}-[0-9]{4,}$/) } }
=> #<Benchmark::Tms:0x0000562bc846aee8 @label="", @real=24.634255946999474, @cstime=0.0, @cutime=0.0, @stime=0.010046, @utime=24.598276, @total=24.608321999999998>

Note: times varies, sometimes /regex/.match?("string") is faster and sometimes "string".match?(/regex/) , the differences maybe only due to the machine activity. 注意:时间变化,有时/regex/.match?("string")更快,有时"string".match?(/regex/) ,差异可能只是由于机器活动。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM