简体繁体 English

在合理的时间内在两个大数据集中查找公共元素

[英]Find Common Elements in two Big Data Set in a Reasonable Time

原文 2018-04-19 08:58:49 0 2 algorithm/ apache-spark/ bigdata

I have two Spark dataframes DFa and DFb, they have same schema, ('country', 'id', 'price', 'name'). 我有两个Spark数据帧DFa和DFb，它们具有相同的架构（“国家”，“ id”，“价格”，“名称”）。

DFa has around 610 million rows, DFa大约有6.1亿行，
DFb has 3000 milllion rows. DFb有3000亿行。

Now I want to find all rows from DFa and DFb that have same id, where id looks like "A6195A55-ACB4-48DD-9E57-5EAF6A056C80". 现在，我想从DFa和DFb中查找具有相同ID的所有行，其中ID看起来像“ A6195A55-ACB4-48DD-9E57-5EAF6A056C80”。

It's a SQL inner join, but when I run Spark SQL inner join, one task got killed because container used too much memory and caused Java heap memory error. 这是一个SQL内部联接，但是当我运行Spark SQL内部联接时，一项任务被杀死，因为容器使用了太多内存并导致Java堆内存错误。 And my cluster has limited resources, tuning YARN and Spark configuration is not a feasible method. 而且我的集群资源有限，调整YARN和Spark配置不是可行的方法。

Is there any other solution to deal with this? 还有其他解决方案吗？ Not using spark solution is also acceptable if the runtime is acceptable. 如果运行时可接受，则不使用Spark解决方案也是可以接受的。

More generally, Can anyone give some algorithms and solutions when find common elements in two very large datasets. 更笼统地说，在两个非常大的数据集中找到共同的元素时，谁能给出一些算法和解决方案。

2 个解决方案

First compute 64 bit hashes of your ids. 首先计算您的ID的64位哈希值。 The comparison will be a lot faster on the hashes, than on the string ids. 在散列上的比较比在字符串ID上的比较要快得多。

My basic idea is: 我的基本想法是：

Build a hash table from DFa. 从DFa构建哈希表。
As you compute the hashes for DFb, you do a lookup in the table. 在计算DFb的哈希值时，您需要在表中进行查找。 If there's nothing there then drop the entry (no match). 如果没有任何内容，则删除条目（不匹配）。 If you get a hit compare the actual IDs to make sure you don't get a false positive. 如果您遇到问题，请比较实际的ID，以确保您不会误判。

The complexity is O(N). 复杂度为O（N）。 Not knowing how many overlaps you expect this is the best you can do since you might have to output everything, because it all matches. 不知道您期望多少重叠是您可以做的最好的，因为您可能必须输出所有内容，因为它们都匹配。

The naive implementation would use about 6GB of ram for the table (assuming 80% occupancy and that you use a flat hash table), but you can do better. 天真的实现将为该表使用大约6GB的内存（假设占用率为80％，并且您使用的是平面哈希表），但是您可以做得更好。 Since we already have the hash, we only need to know if it's exists. 因为我们已经有了哈希，所以我们只需要知道它是否存在。 So you only need one bit to mark that which reduces the memory usage by a lot (you need 64x less memory per entry, but you need to lower occupancy). 因此，您只需要标记一点即可减少内存使用量（每个条目所需的内存减少64倍，但占用率却降低了）。 However this is not a common datastructure so you'll need to implement it. 但是，这不是常见的数据结构，因此您需要实现它。

But there's something even better, something even more compact. 但是有更好的东西，更紧凑的东西。 That is called bloom filter. 这就是所谓的布隆过滤器。 This will introduce some more false positives, but we had to double check anyway because we didn't trust the hash, so it's not a big downside. 这将引入更多的误报，但是无论如何我们都必须仔细检查，因为我们不信任哈希，因此这不是一个很大的缺点。 The best part is that you should be able to find libraries for it already available. 最好的部分是您应该能够找到已经可用的库。

So everything together it looks like this: 所以所有的一切看起来像这样：

Compute hashes from DFa and build a bloom filter. 计算来自DFa的哈希并构建一个Bloom过滤器。
Compute hashes from DFb and check against the bloom filter. 计算DFb的哈希值，然后检查Bloom过滤器。 If you get a match lookup the ID in DFa to make sure it's a real match and add it to the result. 如果找到匹配项，请在DFa中查找ID以确保它是真正的匹配项并将其添加到结果中。

This is a typical usecase in any big data environment. 这是任何大数据环境中的典型用例。 You can use the Map-Side joins where you cache the smaller table which is broadcasted to all the executors. 您可以使用Map-Side联接在其中缓存较小的表，该表广播给所有执行者。

You can read more about broadcasted joins here 您可以在此处了解有关广播加入的更多信息

Broadcast-Joins 广播联接