简体   繁体   中英

Improve Efficiency in Array comparison in Ruby

Hi I am working on Ruby /cucumber and have an requirement to develop a comparison module/program to compare two files.

Below are the requirements

  1. The project is a migration project . Data from one application is moved to another

  2. Need to compare the data from the existing application against the new ones.

Solution :

I have developed a comparison engine in Ruby for the above requirement.

a) Get the data, de duplicated and sorted from both the DB's b) Put the data in a text file with "||" as delimiter c) Use the key columns (number) that provides a unique record in the db to compare the two files

For ex File1 has 1,2,3,4,5,6 and file2 has 1,2,3,4,5,7 and the columns 1,2,3,4,5 are key columns. I use these key columns and compare 6 and 7 which results in a fail.

Issue :

The major issue we are facing here is if the mismatches are more than 70% for 100,000 records or more the comparison time is large. If the mismatches are less than 40% then comparison time is ok.

Diff and Diff -LCS will not work in this case because we need key columns to arrive at accurate data comparison between two applications.

Is there any other method to efficiently reduce the time if the mismatches are more thatn 70% for 100,000 records or more.

Thanks

Let's say you have this extract in your 2 files:

# File 1
id | 1 | 2 | 3
--------------
 1 | A | B | C
 2 | B | A | C

# File 2
id | 1 | 2 | 3
--------------
 8 | A | B | C
 9 | B | B | B

And we do the following function, using a Hash (direct access):

def compare(data_1, data_2)
  headers = data_1.shift
  if headers.size != data_2.shift.size
    return "Headers are not the same!"
  end

  hash = {}
  number_of_columns = headers.size
  data_1.map do |row|
    key = ''
    number_of_columns.times do |index|
      key << row[index].to_s
    end
    hash[key] ||= row
  end

  data_2.each do |row|
    key = ''
    number_of_columns.times do |index|
      key << row[index].to_s
    end
    if hash[key].nil?
      # present in file 1 but not in file 2
    else
      # present in both files
    end
  end
end

# usage
data_file_1 = your_method_to_read_the_file(file_1)
data_file_2 = your_method_to_read_the_file(file_2)

compare(data_file_1, data_file_2)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM