简体   繁体   English

在合理的时间内在两个大数据集中查找公共元素

[英]Find Common Elements in two Big Data Set in a Reasonable Time

I have two Spark dataframes DFa and DFb, they have same schema, ('country', 'id', 'price', 'name'). 我有两个Spark数据帧DFa和DFb,它们具有相同的架构(“国家”,“ id”,“价格”,“名称”)。

  • DFa has around 610 million rows, DFa大约有6.1亿行,
  • DFb has 3000 milllion rows. DFb有3000亿行。

Now I want to find all rows from DFa and DFb that have same id, where id looks like "A6195A55-ACB4-48DD-9E57-5EAF6A056C80". 现在,我想从DFa和DFb中查找具有相同ID的所有行,其中ID看起来像“ A6195A55-ACB4-48DD-9E57-5EAF6A056C80”。

It's a SQL inner join, but when I run Spark SQL inner join, one task got killed because container used too much memory and caused Java heap memory error. 这是一个SQL内部联接,但是当我运行Spark SQL内部联接时,一项任务被杀死,因为容器使用了太多内存并导致Java堆内存错误。 And my cluster has limited resources, tuning YARN and Spark configuration is not a feasible method. 而且我的集群资源有限,调整YARN和Spark配置不是可行的方法。

Is there any other solution to deal with this? 还有其他解决方案吗? Not using spark solution is also acceptable if the runtime is acceptable. 如果运行时可接受,则不使用Spark解决方案也是可以接受的。

More generally, Can anyone give some algorithms and solutions when find common elements in two very large datasets. 更笼统地说,在两个非常大的数据集中找到共同的元素时,谁能给出一些算法和解决方案。

First compute 64 bit hashes of your ids. 首先计算您的ID的64位哈希值。 The comparison will be a lot faster on the hashes, than on the string ids. 在散列上的比较比在字符串ID上的比较要快得多。

My basic idea is: 我的基本想法是:

  • Build a hash table from DFa. 从DFa构建哈希表。
  • As you compute the hashes for DFb, you do a lookup in the table. 在计算DFb的哈希值时,您需要在表中进行查找。 If there's nothing there then drop the entry (no match). 如果没有任何内容,则删除条目(不匹配)。 If you get a hit compare the actual IDs to make sure you don't get a false positive. 如果您遇到问题,请比较实际的ID,以确保您不会误判。

The complexity is O(N). 复杂度为O(N)。 Not knowing how many overlaps you expect this is the best you can do since you might have to output everything, because it all matches. 不知道您期望多少重叠是您可以做的最好的,因为您可能必须输出所有内容,因为它们都匹配。

The naive implementation would use about 6GB of ram for the table (assuming 80% occupancy and that you use a flat hash table), but you can do better. 天真的实现将为该表使用大约6GB的内存(假设占用率为80%,并且您使用的是平面哈希表),但是您可以做得更好。 Since we already have the hash, we only need to know if it's exists. 因为我们已经有了哈希,所以我们只需要知道它是否存在。 So you only need one bit to mark that which reduces the memory usage by a lot (you need 64x less memory per entry, but you need to lower occupancy). 因此,您只需要标记一点即可减少内存使用量(每个条目所需的内存减少64倍,但占用率却降低了)。 However this is not a common datastructure so you'll need to implement it. 但是,这不是常见的数据结构,因此您需要实现它。

But there's something even better, something even more compact. 但是有更好的东西,更紧凑的东西。 That is called bloom filter. 这就是所谓的布隆过滤器。 This will introduce some more false positives, but we had to double check anyway because we didn't trust the hash, so it's not a big downside. 这将引入更多的误报,但是无论如何我们都必须仔细检查,因为我们不信任哈希,因此这不是一个很大的缺点。 The best part is that you should be able to find libraries for it already available. 最好的部分是您应该能够找到已经可用的库。

So everything together it looks like this: 所以所有的一切看起来像这样:

  • Compute hashes from DFa and build a bloom filter. 计算来自DFa的哈希并构建一个Bloom过滤器。
  • Compute hashes from DFb and check against the bloom filter. 计算DFb的哈希值,然后检查Bloom过滤器。 If you get a match lookup the ID in DFa to make sure it's a real match and add it to the result. 如果找到匹配项,请在DFa中查找ID以确保它是真正的匹配项并将其添加到结果中。

This is a typical usecase in any big data environment. 这是任何大数据环境中的典型用例。 You can use the Map-Side joins where you cache the smaller table which is broadcasted to all the executors. 您可以使用Map-Side联接在其中缓存较小的表,该表广播给所有执行者。

You can read more about broadcasted joins here 您可以在此处了解有关广播加入的更多信息

Broadcast-Joins 广播联接

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如果数组大小>可用内存,则查找两个数组的公共元素 - find common elements of two arrays if array size is big > memory available 在线性时间中查找两个排序列表中的公共元素 - Find common elements in two sorted lists in linear time 在线性时间复杂度中查找两个数组中的公共元素 - Find Common elements in two arrays in Linear Time Complexity 用于在两个数组中查找公共元素的 Javascript 程序 - Javascript Program for find common elements in two array (算法)查找两个未排序的数组在O(n)时间内是否有任何公共元素而没有排序? - (Algorithm) Find if two unsorted arrays have any common elements in O(n) time without sorting? (算法)在没有排序的情况下,查找两个未排序的数组在 Θ(n*logn) 时间内是否有任何公共元素? - (Algorithm) Find if two unsorted arrays have any common elements in Θ(n*logn) time without sorting? 设计一个算法来查找两个排序的数字列表中的所有公共元素 - Design an algorithm to find all the common elements in two sorted lists of numbers JavaScript:性能改进以查找两个数组中的公共元素 - JavaScript: performance improvement to find the common elements in two array 查找两个二进制搜索树的公共元素的最佳方法 - Best way to find common elements of two binary search trees 在两个排序序列之间找到共同元素的时间复杂度 - Time complexity of finding common elements between two sorted sequences
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM