Ruby 字符串搜索：拆分或正則表達式哪個更快？

Question

這是一個兩部分的問題。 鑒於您有一個字符串數組，可以在一個字符處拆分（例如，'@' 處的電子郵件地址或 '.' 處的文件名），這是在拆分字符之前查找字符的最高效方法？

my_string.split(char)[0]

要么

my_string[/regex/]

問題的第二部分是如何編寫正則表達式以在字符的第一個實例之前獲取所有內容。 下面的正則表達式在 '.' 之前找到某些字符。 （因為 '.' 不在模式中）但這是我找到解決方案的hacky方式。

my_string[/[A-Za-z0-9\_-]+/]

謝謝！

Answer 1

回答第一部分的最簡單方法是一如既往地用您的真實數據對其進行基准測試。 例如：

require 'benchmark'
Benchmark.bm do |x|
  x.report { 50000.times { a = 'a@b.c'.split('@')[0] } }
  x.report { 50000.times { a = 'a@b.c'[/[^@]+/] } }
end

說（在我的設置上）：

      user     system      total        real
  0.130000   0.010000   0.140000 (  0.130946)
  0.090000   0.000000   0.090000 (  0.096260)

因此，正則表達式解決方案看起來更快一些，但即使迭代 50 000 次，差異也幾乎不明顯。 OTOH，正則表達式解決方案准確地說明了您的意思（“在第一個@之前給我所有內容”），而split解決方案以稍微迂回的方式獲得您想要的結果。

split方法可能較慢，因為它必須掃描整個字符串以將其拆分為多個片段，然后構建一個片段數組，最后提取數組的第一個元素並將其余元素丟棄； 我不知道 VM 是否足夠聰明，可以識別出它不需要構建陣列，所以這只是一些快速的猜測工作。

至於你的第二個問題，說說你的意思：

my_string[/[^.]+/]

如果您想要第一個句點之前的所有內容，請說“直到一個句點為止的所有內容”，而不是“由這些字符組成的第一個塊（碰巧不包含句點）”。

Answer 2

partition將比split更快，因為它不會在第一次匹配后繼續檢查。

帶有index的常規slice將比 regexp slice更快。

隨着匹配之前的字符串部分變大，正則表達式切片也會顯着減慢。 它在大約 10 個字符后變得比原始拆分慢，然后從那里變得更糟。如果你有一個沒有+或*匹配的正則表達式，我認為它會好一點。

require 'benchmark'
n=1000000

def bench n,email
  printf "\n%s %s times\n", email, n
  Benchmark.bm do |x|
      x.report('split    ') do n.times{ email.split('@')[0]  } end
      x.report('partition') do n.times{ email.partition('@').first  } end
      x.report('slice reg') do n.times{ email[/[^@]+/]  } end
      x.report('slice ind') do n.times{ email[0,email.index('@')]  } end
  end
end


bench n, 'a@be.pl'
bench n, 'some_name@regulardomain.com'
bench n, 'some_really_long_long_email_name@regulardomain.com'
bench n, 'some_name@rediculously-extra-long-silly-domain.com'
bench n, 'some_really_long_long_email_name@rediculously-extra-long-silly-domain.com'
bench n, 'a'*254 + '@' + 'b'*253    # rfc limits
bench n, 'a'*1000 + '@' + 'b'*1000  # for other string processing

結果 1.9.3p484：

a@be.pl 1000000 times
       user     system      total        real
split      0.405000   0.000000   0.405000 (  0.410023)
partition  0.375000   0.000000   0.375000 (  0.368021)
slice reg  0.359000   0.000000   0.359000 (  0.357020)
slice ind  0.312000   0.000000   0.312000 (  0.309018)

some_name@regulardomain.com 1000000 times
       user     system      total        real
split      0.421000   0.000000   0.421000 (  0.432025)
partition  0.374000   0.000000   0.374000 (  0.379021)
slice reg  0.421000   0.000000   0.421000 (  0.411024)
slice ind  0.312000   0.000000   0.312000 (  0.315018)

some_really_long_long_email_name@regulardomain.com 1000000 times
       user     system      total        real
split      0.593000   0.000000   0.593000 (  0.589034)
partition  0.531000   0.000000   0.531000 (  0.529030)
slice reg  0.764000   0.000000   0.764000 (  0.771044)
slice ind  0.484000   0.000000   0.484000 (  0.478027)

some_name@rediculously-extra-long-silly-domain.com 1000000 times
       user     system      total        real
split      0.483000   0.000000   0.483000 (  0.481028)
partition  0.390000   0.016000   0.406000 (  0.404023)
slice reg  0.406000   0.000000   0.406000 (  0.411024)
slice ind  0.312000   0.000000   0.312000 (  0.344020)

some_really_long_long_email_name@rediculously-extra-long-silly-domain.com 1000000 times
       user     system      total        real
split      0.639000   0.000000   0.639000 (  0.646037)
partition  0.609000   0.000000   0.609000 (  0.596034)
slice reg  0.764000   0.000000   0.764000 (  0.773044)
slice ind  0.499000   0.000000   0.499000 (  0.491028)

a<254>@b<253> 1000000 times
       user     system      total        real
split      0.952000   0.000000   0.952000 (  0.960055)
partition  0.733000   0.000000   0.733000 (  0.731042)
slice reg  3.432000   0.000000   3.432000 (  3.429196)
slice ind  0.624000   0.000000   0.624000 (  0.625036)

a<1000>@b<1000> 1000000 times
       user     system      total        real
split      1.888000   0.000000   1.888000 (  1.892108)
partition  1.170000   0.016000   1.186000 (  1.188068)
slice reg 12.885000   0.000000  12.885000 ( 12.914739)
slice ind  1.108000   0.000000   1.108000 (  1.097063)

2.1.3p242 擁有大約相同的百分比差異，但在所有方面都快了 10-30%，除了正則表達式拆分，它的速度更慢。

Ruby 字符串搜索：拆分或正則表達式哪個更快？

問題描述

2 個解決方案

解決方案1
13 已采納 2011-09-23 19:09:13

解決方案2
4 2014-10-13 21:43:04

Ruby 字符串搜索：拆分或正則表達式哪個更快？

問題描述

2 個解決方案

解決方案1 13 已采納 2011-09-23 19:09:13

解決方案2 4 2014-10-13 21:43:04

解決方案1
13 已采納 2011-09-23 19:09:13

解決方案2
4 2014-10-13 21:43:04