简体   繁体   English

Ruby中2个文件之间的匹配模式列表

[英]Matching pattern list between 2 files in Ruby

I'm trying to print lines from Input.txt that contains the strings in ValuesToSearch.txt . 我正在尝试从Input.txt打印包含ValuesToSearch.txt中的字符串的 My current script shown below prints the correct output, but when I try with actual data where Input.txt has 9.5 millions of lines and ValuesToSearch.txt has 300 lines, the processing is very very slow. 我下面显示的当前脚本可以打印正确的输出,但是当我尝试使用Input.txt包含950万行而ValuesToSearch.txt包含300行的实际数据时,处理速度非常慢。

How can be modified the script in order to get faster the output? 如何修改脚本以获得更快的输出? Thanks 谢谢

Input.txt Input.txt

ID       HM    PRAO  LN  AC
1401144  851    2    45   32
1401145  6D2    4    45   32
1401146  B33    1    45   32
1401147  EEC    9    45   32
1401148  730    1    45   32
1401149  C08    3    45   32
1401150  B91    4    45   32
1401151  978    1    45   32
1401152  6A9    0    45   32

ValuesToSearch.txt ValuesToSearch.txt

1401176
1401148
1401149
1401151

My script: 我的剧本:

ruby -e '
a=File.foreach("Input.txt").map {|l| l.split(" ")}
b=File.foreach("ValuesToSearch.txt").map {|l| l.split(" ")}.flatten

b.map{ |z| 
    a.map{ |i| puts i.join(" ") if i.include?(z) } 
}'

1401148 730 1 45 32
1401149 C08 3 45 32
1401151 978 1 45 32

What about this? 那这个呢?

dict = File.read('/tmp/ValuesToSearch.txt').split.inject({}) do |acc, word|
  acc[word] = true
  acc
end

File.foreach('/tmp/Input.txt') do |line|
  puts line if line.split.any? { |word| dict[word] }
end

In this approach, I'm using a Hash to store the "values to search". 在这种方法中,我使用哈希来存储“要搜索的值”。
Thus, we can search in O(1) (instead of O(N)). 因此,我们可以搜索O(1)(而不是O(N))。

And you don't need to iterate twice in the words of the Input.txt. 而且您不需要在Input.txt的单词中重复两次。
You can print the lines you want in a single iteration. 您可以在单次迭代中打印所需的行。

And as suggested by @tadman, put this script in a file and execute it using ruby myscript.rb . 并按照@tadman的建议,将此脚本放入文件中,然后使用ruby myscript.rb执行它。

First let's create the two files. 首先让我们创建两个文件。

VTS_FName = "ValuesToSearch.txt"
vts_data = <<-_
1401176
1401148
1401149
1401151
_
File.write(VTS_FName, vts_data)
  #=> 32

IT_FName = "Input.txt"
it_data = <<-_
ID       HM    PRAO  LN  AC
1401144  851    2    45   32
1401145  6D2    4    45   32
1401146  B33    1    45   32
1401147  EEC    9    45   32
1401148  730    1    45   32
1401149  C08    3    45   32
1401150  B91    4    45   32
1401151  978    1    45   32
1401152  6A9    0    45   32
_
File.write(IT_FName, it_data)
  #=> 289

The key to efficiency here is to make the content of VTS_FName a set rather than an array. 效率的关键是使VTS_FName的内容VTS_FName一个集合,而不是一个数组。

require 'set'

vts_set = File.readlines(VTS_FName).map(&:chomp).to_set
File.foreach(IT_FName) { |line| puts line if vts_set.include?(line[/\d+/]) }
1401148  730    1    45   32
1401149  C08    3    45   32
1401151  978    1    45   32

To save the matching lines, rather than printing them, use the following (after creating vts_set ). 要保存匹配的行,而不是打印它们,请使用以下命令(在创建vts_set )。

File.foreach(IT_FName).with_object([]) { |line, arr|
  arr << line.chomp if vts_set.include?(line[/\d+/]) }
  #=> ["1401148  730    1    45   32",
  #    "1401149  C08    3    45   32",
  #    "1401151  978    1    45   32"]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM