I'm trying to print lines from Input.txt that contains the strings in ValuesToSearch.txt . My current script shown below prints the correct output, but when I try with actual data where Input.txt has 9.5 millions of lines and ValuesToSearch.txt has 300 lines, the processing is very very slow.
How can be modified the script in order to get faster the output? Thanks
Input.txt
ID HM PRAO LN AC
1401144 851 2 45 32
1401145 6D2 4 45 32
1401146 B33 1 45 32
1401147 EEC 9 45 32
1401148 730 1 45 32
1401149 C08 3 45 32
1401150 B91 4 45 32
1401151 978 1 45 32
1401152 6A9 0 45 32
ValuesToSearch.txt
1401176
1401148
1401149
1401151
My script:
ruby -e '
a=File.foreach("Input.txt").map {|l| l.split(" ")}
b=File.foreach("ValuesToSearch.txt").map {|l| l.split(" ")}.flatten
b.map{ |z|
a.map{ |i| puts i.join(" ") if i.include?(z) }
}'
1401148 730 1 45 32
1401149 C08 3 45 32
1401151 978 1 45 32
What about this?
dict = File.read('/tmp/ValuesToSearch.txt').split.inject({}) do |acc, word|
acc[word] = true
acc
end
File.foreach('/tmp/Input.txt') do |line|
puts line if line.split.any? { |word| dict[word] }
end
In this approach, I'm using a Hash to store the "values to search".
Thus, we can search in O(1) (instead of O(N)).
And you don't need to iterate twice in the words of the Input.txt.
You can print the lines you want in a single iteration.
And as suggested by @tadman, put this script in a file and execute it using ruby myscript.rb
.
First let's create the two files.
VTS_FName = "ValuesToSearch.txt"
vts_data = <<-_
1401176
1401148
1401149
1401151
_
File.write(VTS_FName, vts_data)
#=> 32
IT_FName = "Input.txt"
it_data = <<-_
ID HM PRAO LN AC
1401144 851 2 45 32
1401145 6D2 4 45 32
1401146 B33 1 45 32
1401147 EEC 9 45 32
1401148 730 1 45 32
1401149 C08 3 45 32
1401150 B91 4 45 32
1401151 978 1 45 32
1401152 6A9 0 45 32
_
File.write(IT_FName, it_data)
#=> 289
The key to efficiency here is to make the content of VTS_FName
a set rather than an array.
require 'set'
vts_set = File.readlines(VTS_FName).map(&:chomp).to_set
File.foreach(IT_FName) { |line| puts line if vts_set.include?(line[/\d+/]) }
1401148 730 1 45 32
1401149 C08 3 45 32
1401151 978 1 45 32
To save the matching lines, rather than printing them, use the following (after creating vts_set
).
File.foreach(IT_FName).with_object([]) { |line, arr|
arr << line.chomp if vts_set.include?(line[/\d+/]) }
#=> ["1401148 730 1 45 32",
# "1401149 C08 3 45 32",
# "1401151 978 1 45 32"]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.