Spark：在大型数据集中搜索匹配项

Question

I need to count how many values of one of the columns of df1 are present in one of the columns of df2. 我需要计算df2列之一中存在df1列之一的值。 (I just need the number of matched values) （我只需要匹配值的数量）

I wouldn't be asking this question if efficiency wasn't such a big concern: 如果效率不是一个大问题，我不会问这个问题：

df1 contains 100,000,000+ records

df2 contains 1,000,000,000+ records

Answer 1

Just an off the top of my head idea for the case that intersection won't cut it: 对于intersection不会切断的情况，我的想法简直是头等大事：

For the datatype that is contained in the columns, find two hash functions h1 , h2 such that 对于列中包含的数据类型，找到两个哈希函数h1 ， h2 ，以便

h1 produces hashes roughly uniformly between 0 and N h1产生的杂凑大致介于0和N之间
h2 produces hashes roughly uniformly between 0 and M h2产生的杂凑大致介于0和M之间

such that M * N is approximately 1B, eg M = 10k , N = 100k , 使得M * N约为1B，例如M = 10k ， N = 100k ，

then: 然后：

map each entry x from the column from df1 to (h1(x), x) 将df1列中的每个条目x映射到(h1(x), x)
map each entry x from the column from df2 to (h1(x), x) 将df2列中的每个条目x映射到(h1(x), x)
group both by h1 into buckets with x s 通过h1分为x s的存储桶
join on h1 (that's gonna be the nasty shuffle) 加入h1 （那将是令人讨厌的洗牌）

then locally, for each pair of buckets (b1, b2) that came from df1 and df2 and had the same h1 hash code, do essentially the same: 然后在本地，对于来自df1和df2并具有相同的h1哈希码的每对存储桶(b1, b2) ，执行基本相同的操作：

compute h2 for all b s from b1 and from b2 , 从b1和b2计算所有b的h2 ，
group by the hash code h2 按哈希码h2分组
Compare the small sub-sub-buckets that remain by converting everything toSet and computing the intersection directly. 通过将所有内容转换为toSet并直接计算交集，比较剩余的小子子桶。

Everything that remains after intersection is present in both df1 and df2 , so compute size and sum the results across all partitions. 交集之后剩下的所有内容都存在于df1和df2 ，因此请计算size并将所有分区的结果求和。

The idea is to select N small enough so that the buckets with M entries still comfortably fit on a single node, but at the same time prevent that the whole application dies on the first shuffle trying to find out where is what by sending every key to everyone else. 这样做的想法是选择足够小的N ，以使具有M个条目的存储桶仍然可以舒适地容纳在单个节点上，但同时又要防止整个应用程序在第一次随机播放时死掉，而是试图通过将每个密钥发送到哪里来找出错误所在。其他所有人。 For example, using SHA-256 as "hash code" for h1 wouldn't help much, because the keys would be essentially unique, so that you could take the original data directly and try to do a shuffle with that. 例如，将SHA-256用作h1 “哈希码”并没有多大帮助，因为键本质上是唯一的，因此您可以直接获取原始数据并尝试对其进行随机播放。 However, if you restrict N to some reasonably small number, eg 10k, you obtain a rough approximation of where what is, so that you can then regroup the buckets and start the second stage with h2 . 但是，如果将N限制在某个较小的数字（例如10k），则可以大致了解位置是什么，以便可以重新组合存储桶并以h2开始第二阶段。

Essentially it's just a random guess, I didn't test it. 本质上，这只是一个随机的猜测，我没有进行测试。 It could well be that the built-in intersection is smarter than everything I could possibly come up with. 内置intersection可能比我能想到的一切都要聪明。

Spark：在大型数据集中搜索匹配项

问题描述

1 个解决方案

解决方案1
0 2018-04-06 20:29:37

Spark：在大型数据集中搜索匹配项

问题描述

1 个解决方案

解决方案1 0 2018-04-06 20:29:37

解决方案1
0 2018-04-06 20:29:37