[英]Find Common Elements in two Big Data Set in a Reasonable Time
I have two Spark dataframes DFa and DFb, they have same schema, ('country', 'id', 'price', 'name'). 我有两个Spark数据帧DFa和DFb,它们具有相同的架构(“国家”,“ id”,“价格”,“名称”)。
Now I want to find all rows from DFa and DFb that have same id, where id looks like "A6195A55-ACB4-48DD-9E57-5EAF6A056C80". 现在,我想从DFa和DFb中查找具有相同ID的所有行,其中ID看起来像“ A6195A55-ACB4-48DD-9E57-5EAF6A056C80”。
It's a SQL inner join, but when I run Spark SQL inner join, one task got killed because container used too much memory and caused Java heap memory error. 这是一个SQL内部联接,但是当我运行Spark SQL内部联接时,一项任务被杀死,因为容器使用了太多内存并导致Java堆内存错误。 And my cluster has limited resources, tuning YARN and Spark configuration is not a feasible method.
而且我的集群资源有限,调整YARN和Spark配置不是可行的方法。
Is there any other solution to deal with this? 还有其他解决方案吗? Not using spark solution is also acceptable if the runtime is acceptable.
如果运行时可接受,则不使用Spark解决方案也是可以接受的。
More generally, Can anyone give some algorithms and solutions when find common elements in two very large datasets. 更笼统地说,在两个非常大的数据集中找到共同的元素时,谁能给出一些算法和解决方案。
First compute 64 bit hashes of your ids. 首先计算您的ID的64位哈希值。 The comparison will be a lot faster on the hashes, than on the string ids.
在散列上的比较比在字符串ID上的比较要快得多。
My basic idea is: 我的基本想法是:
The complexity is O(N). 复杂度为O(N)。 Not knowing how many overlaps you expect this is the best you can do since you might have to output everything, because it all matches.
不知道您期望多少重叠是您可以做的最好的,因为您可能必须输出所有内容,因为它们都匹配。
The naive implementation would use about 6GB of ram for the table (assuming 80% occupancy and that you use a flat hash table), but you can do better. 天真的实现将为该表使用大约6GB的内存(假设占用率为80%,并且您使用的是平面哈希表),但是您可以做得更好。 Since we already have the hash, we only need to know if it's exists.
因为我们已经有了哈希,所以我们只需要知道它是否存在。 So you only need one bit to mark that which reduces the memory usage by a lot (you need 64x less memory per entry, but you need to lower occupancy).
因此,您只需要标记一点即可减少内存使用量(每个条目所需的内存减少64倍,但占用率却降低了)。 However this is not a common datastructure so you'll need to implement it.
但是,这不是常见的数据结构,因此您需要实现它。
But there's something even better, something even more compact. 但是有更好的东西,更紧凑的东西。 That is called bloom filter.
这就是所谓的布隆过滤器。 This will introduce some more false positives, but we had to double check anyway because we didn't trust the hash, so it's not a big downside.
这将引入更多的误报,但是无论如何我们都必须仔细检查,因为我们不信任哈希,因此这不是一个很大的缺点。 The best part is that you should be able to find libraries for it already available.
最好的部分是您应该能够找到已经可用的库。
So everything together it looks like this: 所以所有的一切看起来像这样:
This is a typical usecase in any big data environment. 这是任何大数据环境中的典型用例。 You can use the Map-Side joins where you cache the smaller table which is broadcasted to all the executors.
您可以使用Map-Side联接在其中缓存较小的表,该表广播给所有执行者。
You can read more about broadcasted joins here 您可以在此处了解有关广播加入的更多信息
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.