Hi I am working on Ruby /cucumber and have an requirement to develop a comparison module/program to compare two files.
Below are the requirements
The project is a migration project . Data from one application is moved to another
Need to compare the data from the existing application against the new ones.
Solution :
I have developed a comparison engine in Ruby for the above requirement.
a) Get the data, de duplicated and sorted from both the DB's b) Put the data in a text file with "||" as delimiter c) Use the key columns (number) that provides a unique record in the db to compare the two files
For ex File1 has 1,2,3,4,5,6 and file2 has 1,2,3,4,5,7 and the columns 1,2,3,4,5 are key columns. I use these key columns and compare 6 and 7 which results in a fail.
Issue :
The major issue we are facing here is if the mismatches are more than 70% for 100,000 records or more the comparison time is large. If the mismatches are less than 40% then comparison time is ok.
Diff and Diff -LCS will not work in this case because we need key columns to arrive at accurate data comparison between two applications.
Is there any other method to efficiently reduce the time if the mismatches are more thatn 70% for 100,000 records or more.
Thanks
Let's say you have this extract in your 2 files:
# File 1
id | 1 | 2 | 3
--------------
1 | A | B | C
2 | B | A | C
# File 2
id | 1 | 2 | 3
--------------
8 | A | B | C
9 | B | B | B
And we do the following function, using a Hash (direct access):
def compare(data_1, data_2)
headers = data_1.shift
if headers.size != data_2.shift.size
return "Headers are not the same!"
end
hash = {}
number_of_columns = headers.size
data_1.map do |row|
key = ''
number_of_columns.times do |index|
key << row[index].to_s
end
hash[key] ||= row
end
data_2.each do |row|
key = ''
number_of_columns.times do |index|
key << row[index].to_s
end
if hash[key].nil?
# present in file 1 but not in file 2
else
# present in both files
end
end
end
# usage
data_file_1 = your_method_to_read_the_file(file_1)
data_file_2 = your_method_to_read_the_file(file_2)
compare(data_file_1, data_file_2)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.