[英]Matching pattern list between 2 files in Ruby
I'm trying to print lines from Input.txt that contains the strings in ValuesToSearch.txt . 我正在尝试从Input.txt打印包含ValuesToSearch.txt中的字符串的行 。 My current script shown below prints the correct output, but when I try with actual data where Input.txt has 9.5 millions of lines and ValuesToSearch.txt has 300 lines, the processing is very very slow.
我下面显示的当前脚本可以打印正确的输出,但是当我尝试使用Input.txt包含950万行而ValuesToSearch.txt包含300行的实际数据时,处理速度非常慢。
How can be modified the script in order to get faster the output? 如何修改脚本以获得更快的输出? Thanks
谢谢
Input.txt Input.txt
ID HM PRAO LN AC
1401144 851 2 45 32
1401145 6D2 4 45 32
1401146 B33 1 45 32
1401147 EEC 9 45 32
1401148 730 1 45 32
1401149 C08 3 45 32
1401150 B91 4 45 32
1401151 978 1 45 32
1401152 6A9 0 45 32
ValuesToSearch.txt ValuesToSearch.txt
1401176
1401148
1401149
1401151
My script: 我的剧本:
ruby -e '
a=File.foreach("Input.txt").map {|l| l.split(" ")}
b=File.foreach("ValuesToSearch.txt").map {|l| l.split(" ")}.flatten
b.map{ |z|
a.map{ |i| puts i.join(" ") if i.include?(z) }
}'
1401148 730 1 45 32
1401149 C08 3 45 32
1401151 978 1 45 32
What about this? 那这个呢?
dict = File.read('/tmp/ValuesToSearch.txt').split.inject({}) do |acc, word|
acc[word] = true
acc
end
File.foreach('/tmp/Input.txt') do |line|
puts line if line.split.any? { |word| dict[word] }
end
In this approach, I'm using a Hash to store the "values to search". 在这种方法中,我使用哈希来存储“要搜索的值”。
Thus, we can search in O(1) (instead of O(N)). 因此,我们可以搜索O(1)(而不是O(N))。
And you don't need to iterate twice in the words of the Input.txt. 而且您不需要在Input.txt的单词中重复两次。
You can print the lines you want in a single iteration. 您可以在单次迭代中打印所需的行。
And as suggested by @tadman, put this script in a file and execute it using ruby myscript.rb
. 并按照@tadman的建议,将此脚本放入文件中,然后使用
ruby myscript.rb
执行它。
First let's create the two files. 首先让我们创建两个文件。
VTS_FName = "ValuesToSearch.txt"
vts_data = <<-_
1401176
1401148
1401149
1401151
_
File.write(VTS_FName, vts_data)
#=> 32
IT_FName = "Input.txt"
it_data = <<-_
ID HM PRAO LN AC
1401144 851 2 45 32
1401145 6D2 4 45 32
1401146 B33 1 45 32
1401147 EEC 9 45 32
1401148 730 1 45 32
1401149 C08 3 45 32
1401150 B91 4 45 32
1401151 978 1 45 32
1401152 6A9 0 45 32
_
File.write(IT_FName, it_data)
#=> 289
The key to efficiency here is to make the content of VTS_FName
a set rather than an array. 效率的关键是使
VTS_FName
的内容VTS_FName
一个集合,而不是一个数组。
require 'set'
vts_set = File.readlines(VTS_FName).map(&:chomp).to_set
File.foreach(IT_FName) { |line| puts line if vts_set.include?(line[/\d+/]) }
1401148 730 1 45 32
1401149 C08 3 45 32
1401151 978 1 45 32
To save the matching lines, rather than printing them, use the following (after creating vts_set
). 要保存匹配的行,而不是打印它们,请使用以下命令(在创建
vts_set
)。
File.foreach(IT_FName).with_object([]) { |line, arr|
arr << line.chomp if vts_set.include?(line[/\d+/]) }
#=> ["1401148 730 1 45 32",
# "1401149 C08 3 45 32",
# "1401151 978 1 45 32"]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.