查找數組是否包含另一個數組的任何成員的最快方法？

Question

我正在嘗試檢測以字符串形式讀取的文件是否為：

文本（某種類型的單字節編碼）。
多字節編碼或二進制等

我有一個“黑名單”字符/字節數組，通常不會在“文本”中出現：

bad_bytes = [0, 1, 2, 3, 4, 5, 6, 11, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 28, 29, 30, 31, 127]

和my_bytes = File.binread('some_file').bytes 。

我可以想到：

(my_bytes & bad_bytes).empty? 和
my_bytes == (my_bytes - bad_bytes)

兩者都能產生正確的結果，我的直覺是后者可能會更快一些。 或者，也許它們是完全等同的？ 但是，就我的目的而言，兩者對我而言似乎效率都很低。 我不需要真正找到完整的交集，也不需要從第一個數組中刪除第二個數組的每個實例-查找一個公共元素就足夠了。

我是否缺少某種已經存在的方法來執行此操作？ 有更快的技術嗎？ 如果沒有，以上哪個更快？ 還是我把這一切都弄錯了？

另外，要獲得加分，請問：我在這里打算做些數學 / 計算機科學 / 幻想術語嗎？

Answer 1

您可以避免使用正則表達式和String#[]轉換為字節數組：

bad_bytes_pattern = /[#{ Regexp.escape(bad_bytes.map(&:chr).join) }]/n
#=> /[\x00\x01\x02\x03\x04\x05\x06\v\x0E\x0F\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1A\x1C\x1D\x1E\x1F\x7F]/

str = File.binread('some_file')

if str[bad_bytes_pattern]
  # contains bad bytes
else
  # ...
end

您可以通過使用字符范圍來簡化正則表達式：

bad_bytes_pattern = /[\x00-\x06\x0B\x0E-\x1A\x1C-\x1F\x7F]/n

Answer 2

你不能使用none? 查看bad_bytes列表中是否存在任何字符：

my_bytes.none? { |b| bad_bytes.include? b }

這樣做的好處是，循環將在第一個字符與謂詞匹配時停止，而不是遍歷整個事物。

您可以通過將進一步優化該bad_bytes在Set ：

bad_bytes = Set[0, 1, 2, 3, 4, 5, 6, 11, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 28, 29, 30, 31, 127]

Answer 3

先生們！ 啟動引擎！

以下是迄今為止給出的三個答案的基准比較。 我這樣做的主要原因是要評估使用正則表達式的@Stefan解決方案的相對效率。 我給人以正則表達式的效率通常相對較低的印象，但是從下面的結果可以看出，這里肯定不是這種情況。

@Uri和我的解決方案顯示了通過將不良字符數組轉換為集合，以及逐字節讀取文件的方式，可以帶來多大的改進。 我很抱歉，@ Uri，如果我沒有按照您的方式將文件讀入數組。

我希望看到更多關於SO答案的基准測試。 這並不困難也不費時間，並且可以提供有用的見解。 我發現大部分時間都在准備測試用例。 注意，我已經將要測試的方法放在模塊中，因此，如果要對另一種方法進行基准測試，則只需將該方法添加到模塊中，而無需觸摸其他任何代碼。

方法比較

module Methods
  require 'set'

  Bad_bytes_pattern = /[\x00-\x06\x0B\x0E-\x1A\x1C-\x1F\x7F]/n
  Bad_bytes = [*0..6, 11, *14..26, *28..31, 127]
  Bad_chars = Bad_bytes.map(&:chr)
  Bad_bytes_set = Set[*Bad_bytes]
  Bad_chars_set = Set[*Bad_chars]

  def stefan(fname)
    File.read(fname)[Bad_bytes_pattern]
  end

  def uri_with_array(fname)
    !File.read(fname).each_char.map(&:ord).none? { |b|
      Bad_bytes.include? b }
  end

  def uri_with_set(fname)
    !File.read(fname).each_char.map(&:ord).none? { |b|
      Bad_bytes_set.include? b }
  end

  def cary(fname)
    f = File.new fname
    f.each_char.any? { |c| Bad_chars_set.include?(c) }
  end
end

包括模塊

include Methods
@methods = Methods.instance_methods(false)
  #=> [:stefan, :uri_with_array, :uri_with_set, :cary]

創建測試文件

def make_test_files(prefix, nbr_files, file_size, prob_bad_byte)
  nbr_bad_bytes = Bad_bytes.size
  nbr_files.times.with_object([]) do |i, fnames|
    str = 'x'*file_size
    str[rand(file_size)] = Bad_chars[rand(nbr_bad_bytes)] if
      rand < prob_bad_byte
    fname = "#{prefix}.#{i}"
    File.write(fname, str)
    fnames << fname
  end
end

N = 50
M = 100_000
Prob_bad_byte = 0.5

@test_files = make_test_files('test', N, M, Prob_bad_byte)

創建一個輔助方法

調用方法m處理所有測試文件並返回true / false數組，如果在給定文件中發現錯誤字節，則返回true：

def compute(m)
  @test_files.each_with_object([]) { |fname,arr|
    arr << (send(m, fname) ? true : false) }
end

寫測試頭

puts "#{N} files of size #{M}.\n" +
  "Each file contains zero or one bad characters, the probability of the " +
  "latter being #{Prob_bad_byte}. If a bad character is present, it is at " +
  "a random location in the file.\n\n"

確認所有測試方法都返回相同的測試數據值

unless @methods.map { |m| compute(m) }.uniq.size == 1
  print "Not all methods agree"
  exit
end

編寫基准

require 'benchmark'

@indent = methods.map { |m| m.to_s.size }.max

Benchmark.bm(@indent) do |bm|
  @methods.each do |m|
    bm.report m.to_s do
      compute(m)
    end
  end
end

清理后

@test_files.each { |fname| File.delete fname }

手工編碼測試參數的結果

50個大小為10000的文件。每個文件包含零個或一個壞字符，后者的概率為0.5。 如果存在錯誤字符，則該錯誤字符位於文件中的隨機位置。

                                 user     system      total        real
stefan                       0.000000   0.000000   0.000000 (  0.003874)
uri_with_array               0.560000   0.000000   0.560000 (  0.565312)
uri_with_set                 0.170000   0.010000   0.180000 (  0.173694)
cary                         0.100000   0.000000   0.100000 (  0.100730)

50個大小為100000的文件。每個文件包含零個或一個壞字符，后者的概率為0.5。 如果存在錯誤字符，則該錯誤字符位於文件中的隨機位置。

                                 user     system      total        real
stefan                       0.030000   0.000000   0.030000 (  0.027062)
uri_with_array               5.340000   0.040000   5.380000 (  5.387314)
uri_with_set                 1.640000   0.040000   1.680000 (  1.683844)
cary                         0.930000   0.010000   0.940000 (  0.929722)

50個大小為100000的文件。每個文件包含零個或一個壞字符，后者的概率為1.0。 如果存在錯誤字符，則該錯誤字符位於文件中的隨機位置。

                                 user     system      total        real
stefan                       0.020000   0.010000   0.030000 (  0.022462)
uri_with_array               4.410000   0.030000   4.440000 (  4.447397)
uri_with_set                 1.520000   0.040000   1.560000 (  1.560788)
cary                         0.740000   0.010000   0.750000 (  0.747580)

Answer 4

我建議做兩件事以提高效率：

逐字節讀取文件（在后台逐塊讀取），直到找到錯誤字符或讀取整個文件而未找到錯誤字符為止。
將壞字節數組轉換為一組字符，以加快查找速度。

碼

require 'set'

def bad_byte?(text, bad_bytes)
  bb = Set.new(bad_bytes.map(&:chr))
  f = File.new 'test'
  f.each_char.any? { |c| bb.include?(c) }
end

例子

bad_bytes = [*0..6, 11, *14..26, *28..31, 127]
  #=> [ 0,  1,  2,  3,  4,  5,  6, 11, 14, 15, 16, 17,  18,
  #    19, 20, 21, 22, 23, 24, 25, 26, 28, 29, 30, 31, 127]

Read a 'good' test string from a file named `'test'`.

text = "Now is the time for all good people"
File.write('test', text)
bad_byte?(text, bad_bytes) #=> false

Read a 'bad' test string from a file named `'test'`.

text = "Now is the time " + 3.chr + "for all good people"
File.write('test', text)
bad_byte?(text, bad_bytes) #=> true

查找數組是否包含另一個數組的任何成員的最快方法？

問題描述

4 個解決方案

解決方案1
3 2014-12-02 12:09:39

解決方案2
1 2014-12-02 11:19:49

解決方案3
1 已采納 2014-12-04 00:02:04

解決方案4
0 2014-12-03 04:03:01

查找數組是否包含另一個數組的任何成員的最快方法？

問題描述

4 個解決方案

解決方案1 3 2014-12-02 12:09:39

解決方案2 1 2014-12-02 11:19:49

解決方案3 1 已采納 2014-12-04 00:02:04

解決方案4 0 2014-12-03 04:03:01

解決方案1
3 2014-12-02 12:09:39

解決方案2
1 2014-12-02 11:19:49

解決方案3
1 已采納 2014-12-04 00:02:04

解決方案4
0 2014-12-03 04:03:01