[英]Spark: Searching for matches in large DataSets
I need to count how many values of one of the columns of df1 are present in one of the columns of df2. 我需要计算df2列之一中存在df1列之一的值。 (I just need the number of matched values) (我只需要匹配值的数量)
I wouldn't be asking this question if efficiency wasn't such a big concern: 如果效率不是一个大问题,我不会问这个问题:
df1 contains 100,000,000+ records
df2 contains 1,000,000,000+ records
Just an off the top of my head idea for the case that intersection
won't cut it: 对于intersection
不会切断的情况,我的想法简直是头等大事:
For the datatype that is contained in the columns, find two hash functions h1
, h2
such that 对于列中包含的数据类型,找到两个哈希函数h1
, h2
,以便
h1
produces hashes roughly uniformly between 0 and N h1
产生的杂凑大致介于0和N之间 h2
produces hashes roughly uniformly between 0 and M h2
产生的杂凑大致介于0和M之间 such that M * N
is approximately 1B, eg M = 10k
, N = 100k
, 使得M * N
约为1B,例如M = 10k
, N = 100k
,
then: 然后:
x
from the column from df1
to (h1(x), x)
将df1
列中的每个条目x
映射到(h1(x), x)
x
from the column from df2
to (h1(x), x)
将df2
列中的每个条目x
映射到(h1(x), x)
h1
into buckets with x
s 通过h1
分为x
s的存储桶 h1
(that's gonna be the nasty shuffle) 加入h1
(那将是令人讨厌的洗牌) then locally, for each pair of buckets (b1, b2)
that came from df1
and df2
and had the same h1
hash code, do essentially the same: 然后在本地,对于来自df1
和df2
并具有相同的h1
哈希码的每对存储桶(b1, b2)
,执行基本相同的操作:
h2
for all b
s from b1
and from b2
, 从b1
和b2
计算所有b
的h2
, h2
按哈希码h2
分组 toSet
and computing the intersection directly. 通过将所有内容转换为toSet
并直接计算交集,比较剩余的小子子桶。 Everything that remains after intersection is present in both df1
and df2
, so compute size
and sum the results across all partitions. 交集之后剩下的所有内容都存在于df1
和df2
,因此请计算size
并将所有分区的结果求和。
The idea is to select N
small enough so that the buckets with M
entries still comfortably fit on a single node, but at the same time prevent that the whole application dies on the first shuffle trying to find out where is what by sending every key to everyone else. 这样做的想法是选择足够小的N
,以使具有M
个条目的存储桶仍然可以舒适地容纳在单个节点上,但同时又要防止整个应用程序在第一次随机播放时死掉,而是试图通过将每个密钥发送到哪里来找出错误所在。其他所有人。 For example, using SHA-256 as "hash code" for h1
wouldn't help much, because the keys would be essentially unique, so that you could take the original data directly and try to do a shuffle with that. 例如,将SHA-256用作h1
“哈希码”并没有多大帮助,因为键本质上是唯一的,因此您可以直接获取原始数据并尝试对其进行随机播放。 However, if you restrict N
to some reasonably small number, eg 10k, you obtain a rough approximation of where what is, so that you can then regroup the buckets and start the second stage with h2
. 但是,如果将N
限制在某个较小的数字(例如10k),则可以大致了解位置是什么,以便可以重新组合存储桶并以h2
开始第二阶段。
Essentially it's just a random guess, I didn't test it. 本质上,这只是一个随机的猜测,我没有进行测试。 It could well be that the built-in intersection
is smarter than everything I could possibly come up with. 内置intersection
可能比我能想到的一切都要聪明。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.