提高Ruby中数组比较的效率

Question

Hi I am working on Ruby /cucumber and have an requirement to develop a comparison module/program to compare two files. 嗨，我正在研究Ruby / cucumber，并且需要开发一个比较模块/程序来比较两个文件。

Below are the requirements 以下是要求

The project is a migration project . 该项目是一个迁移项目。 Data from one application is moved to another 来自一个应用程序的数据被移至另一个
Need to compare the data from the existing application against the new ones. 需要将现有应用程序中的数据与新应用程序中的数据进行比较。

Solution : 解决方案：

I have developed a comparison engine in Ruby for the above requirement. 我已经针对上述需求在Ruby中开发了一个比较引擎。

a) Get the data, de duplicated and sorted from both the DB's b) Put the data in a text file with "||" a）从两个数据库中获取重复和排序的数据b）将数据放入带有“ ||”的文本文件中 as delimiter c) Use the key columns (number) that provides a unique record in the db to compare the two files 作为定界符c）使用在数据库中提供唯一记录的键列（数字）比较两个文件

For ex File1 has 1,2,3,4,5,6 and file2 has 1,2,3,4,5,7 and the columns 1,2,3,4,5 are key columns. 例如，File1有1,2,3,4,5,6，file2有1,2,3,4,5,7，而列1,2,3,4,5是键列。 I use these key columns and compare 6 and 7 which results in a fail. 我使用这些关键列并比较6和7，这将导致失败。

Issue : 问题：

The major issue we are facing here is if the mismatches are more than 70% for 100,000 records or more the comparison time is large. 我们在这里面临的主要问题是，如果10万条记录的不匹配率超过70％，那么比较时间就很大。 If the mismatches are less than 40% then comparison time is ok. 如果不匹配小于40％，则比较时间就可以了。

Diff and Diff -LCS will not work in this case because we need key columns to arrive at accurate data comparison between two applications. 在这种情况下，Diff和Diff -LCS将不起作用，因为我们需要关键列才能在两个应用程序之间进行准确的数据比较。

Is there any other method to efficiently reduce the time if the mismatches are more thatn 70% for 100,000 records or more. 如果不匹配超过100,000条记录的70％，还有其他方法可以有效地减少时间。

Thanks 谢谢

Answer 1

Let's say you have this extract in your 2 files: 假设您在2个文件中有此摘录：

# File 1
id | 1 | 2 | 3
--------------
 1 | A | B | C
 2 | B | A | C

# File 2
id | 1 | 2 | 3
--------------
 8 | A | B | C
 9 | B | B | B

And we do the following function, using a Hash (direct access): 我们使用哈希 （直接访问）执行以下功能：

def compare(data_1, data_2)
  headers = data_1.shift
  if headers.size != data_2.shift.size
    return "Headers are not the same!"
  end

  hash = {}
  number_of_columns = headers.size
  data_1.map do |row|
    key = ''
    number_of_columns.times do |index|
      key << row[index].to_s
    end
    hash[key] ||= row
  end

  data_2.each do |row|
    key = ''
    number_of_columns.times do |index|
      key << row[index].to_s
    end
    if hash[key].nil?
      # present in file 1 but not in file 2
    else
      # present in both files
    end
  end
end

# usage
data_file_1 = your_method_to_read_the_file(file_1)
data_file_2 = your_method_to_read_the_file(file_2)

compare(data_file_1, data_file_2)

提高Ruby中数组比较的效率

问题描述

1 个解决方案

解决方案1
0 2013-11-12 21:26:03

提高Ruby中数组比较的效率

问题描述

1 个解决方案

解决方案1 0 2013-11-12 21:26:03

解决方案1
0 2013-11-12 21:26:03