简体   繁体   中英

Matching pattern list between 2 files in Ruby

I'm trying to print lines from Input.txt that contains the strings in ValuesToSearch.txt . My current script shown below prints the correct output, but when I try with actual data where Input.txt has 9.5 millions of lines and ValuesToSearch.txt has 300 lines, the processing is very very slow.

How can be modified the script in order to get faster the output? Thanks

Input.txt

ID       HM    PRAO  LN  AC
1401144  851    2    45   32
1401145  6D2    4    45   32
1401146  B33    1    45   32
1401147  EEC    9    45   32
1401148  730    1    45   32
1401149  C08    3    45   32
1401150  B91    4    45   32
1401151  978    1    45   32
1401152  6A9    0    45   32

ValuesToSearch.txt

1401176
1401148
1401149
1401151

My script:

ruby -e '
a=File.foreach("Input.txt").map {|l| l.split(" ")}
b=File.foreach("ValuesToSearch.txt").map {|l| l.split(" ")}.flatten

b.map{ |z| 
    a.map{ |i| puts i.join(" ") if i.include?(z) } 
}'

1401148 730 1 45 32
1401149 C08 3 45 32
1401151 978 1 45 32

What about this?

dict = File.read('/tmp/ValuesToSearch.txt').split.inject({}) do |acc, word|
  acc[word] = true
  acc
end

File.foreach('/tmp/Input.txt') do |line|
  puts line if line.split.any? { |word| dict[word] }
end

In this approach, I'm using a Hash to store the "values to search".
Thus, we can search in O(1) (instead of O(N)).

And you don't need to iterate twice in the words of the Input.txt.
You can print the lines you want in a single iteration.

And as suggested by @tadman, put this script in a file and execute it using ruby myscript.rb .

First let's create the two files.

VTS_FName = "ValuesToSearch.txt"
vts_data = <<-_
1401176
1401148
1401149
1401151
_
File.write(VTS_FName, vts_data)
  #=> 32

IT_FName = "Input.txt"
it_data = <<-_
ID       HM    PRAO  LN  AC
1401144  851    2    45   32
1401145  6D2    4    45   32
1401146  B33    1    45   32
1401147  EEC    9    45   32
1401148  730    1    45   32
1401149  C08    3    45   32
1401150  B91    4    45   32
1401151  978    1    45   32
1401152  6A9    0    45   32
_
File.write(IT_FName, it_data)
  #=> 289

The key to efficiency here is to make the content of VTS_FName a set rather than an array.

require 'set'

vts_set = File.readlines(VTS_FName).map(&:chomp).to_set
File.foreach(IT_FName) { |line| puts line if vts_set.include?(line[/\d+/]) }
1401148  730    1    45   32
1401149  C08    3    45   32
1401151  978    1    45   32

To save the matching lines, rather than printing them, use the following (after creating vts_set ).

File.foreach(IT_FName).with_object([]) { |line, arr|
  arr << line.chomp if vts_set.include?(line[/\d+/]) }
  #=> ["1401148  730    1    45   32",
  #    "1401149  C08    3    45   32",
  #    "1401151  978    1    45   32"]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM