简体   繁体   English

比较两个巨大的 arrays 对象的最有效方法

[英]The Most efficient way to compare two huge arrays of objects

I want to compare two huge arrays, I'm reading those two arrays in batches (getting 10 objects per time from each array).我想比较两个巨大的 arrays,我正在分批读取这两个 arrays(每次从每个数组中获取 10 个对象)。 After complete reading those two arrays, I want to have the following data (The intersection between the two huge arrays - Objects exist in the first array only -Objects exist in the second array only).读完这两个 arrays 后,我想要以下数据(两个巨大的 arrays 之间的交集 - 对象仅存在于第一个数组中 - 对象仅存在于第二个数组中)。 What is the best practice to do that?这样做的最佳做法是什么?

Example in small scale:小规模示例:

let arr1 = [ obj1, obj2, obj3, obj4, obj5, obj6, obj7];让 arr1 = [ obj1,obj2,obj3,obj4,obj5,obj6,obj7];

let arr2 = [ obj7, obj2, obj5, obj1, obj9, obj8];让 arr2 = [ obj7, obj2, obj5, obj1, obj9, obj8];

Then I will read the two arrays in batches (two elements per time):然后分批读取这两个arrays(每次两个元素):

First loop第一个循环

->obj2 is mutual ->obj2 是相互的

->obj1 exist in arr1 only ->obj1 仅存在于 arr1 中

->obj7 exist in arr2 only ->obj7 仅存在于 arr2

The issue here, it is not the final result until I complete looping on the whole arrays to get the correct result which is:这里的问题,直到我完成对整个 arrays 的循环以获得正确的结果,这不是最终结果,即:

The mutual objects are obj1,obj2,obj5,obj7相互对象为 obj1,obj2,obj5,obj7

Objects in arr1 only are obj3,obj4,obj6 arr1 中的对象只有 obj3,obj4,obj6

Objects in arr2 only are obj8,obj9 arr2 中的对象只有 obj8,obj9

Note: I've to read the arrays in batches because they are too big.注意:我必须分批阅读 arrays,因为它们太大了。

In order to compare your arrays efficiently, you need to sort them out in some way.为了有效地比较您的 arrays,您需要以某种方式对它们进行分类。 This is true whether or not the arrays are too big to fit into memory.无论 arrays 是否太大而无法装入 memory,都是如此。

Conventionally, there are two choices: either sort the objects in each array and compare them in order, or hash the objects in each array and compare them with a hash map.按照惯例,有两种选择:要么对每个数组中的对象进行排序并按顺序比较它们,要么对每个数组中的对象进行 hash map 比较。

Each method has techniques to handle data too big to fit in memory.每种方法都有处理太大而无法放入 memory 的数据的技术。 For sorting, there are "external" sorting algorithms not limited by memory size, and simple data streaming for comparison.对于排序,有不受 memory 大小限制的“外部”排序算法,以及用于比较的简单数据流。 For hashing, you can partition the data (according to hash) into bins small enough to process in-memory.对于散列,您可以将数据(根据散列)划分为足够小的 bin 以在内存中处理。


As an example, consider this Python-like pseudocode for hash-binning your data items:例如,考虑这个类似 Python 的伪代码,用于对数据项进行哈希分箱:

// split data into bins
files = []
for i in 0 .. N-1:
    files.push_back(open_for_write("{filename}_bin{i}"))
for item in read_items(open_for_read(filename)):
    bin = item.hash() mod N
    write_item(item, files[bin])

You can do this for both your input files, then process them by bin:您可以对两个输入文件执行此操作,然后按 bin 处理它们:

// compare by bin
outfile = open_for_write(out_filename)
for i in 0 .. N-1:
    items = new_set()
    for item in read_items(open_for_read("{in_filename_1}_bin{i}")):
        items.insert(item)
    for item in read_items(open_for_read("{in_filename_2}_bin{i}")):
        if item in items:
            write_item(item, outfile)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 React useEffect():比较两个对象的 arrays 是否相等的最有效方法 - React useEffect() : Most efficient way to compare if two arrays of objects are equal 合并两个对象数组的最有效方法 - Most efficient way to merge two arrays of objects 比较两个字符串数组Javascript的最快/最有效的方法 - Fastest / most efficient way to compare two string arrays Javascript 比较两个对象的值的最有效方法是什么? - What's the most efficient way to compare values of two objects? 比较两个数组的有效方法 - Efficient way to compare two arrays 以编程方式从嵌套对象的巨大 JSON 和 arrays 中解构特定属性的最佳且最有效的方法 - Best -and most efficient- way to programmatically destructure specific properties from a HUGE JSON of nested objects and arrays 转换对象的 arrays 的最有效方法 - Most efficient way to transform arrays of objects 这是比较两个 arrays 对象和修改属性的最有效的 JavaScript 吗? - Is this the most efficient JavaScript for comparing two arrays of objects and modifying properties? 在两个索引之间查找所有对象的最有效方法 - Most efficient way to find all objects in between two indexes 检查两个物体是否发生碰撞和减速的最有效方法是什么? - What is most efficient way to check if two objects are colliding and decellerate on impact?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM