[英]Spark: Efficient mass lookup in pair RDD's
In Apache Spark I have two RDD's. 在Apache Spark中我有两个RDD。 The first
data : RDD[(K,V)]
containing data in key-value form. 第一个
data : RDD[(K,V)]
包含键值形式的数据。 The second pairs : RDD[(K,K)]
contains a set of interesting key-pairs of this data. 第二
pairs : RDD[(K,K)]
包含一组有趣的数据密钥对。
How can I efficiently construct an RDD pairsWithData : RDD[((K,K)),(V,V))]
, such that it contains all the elements from pairs
as the key-tuple and their corresponding values (from data
) as the value-tuple? 如何有效地构造RDD对与
pairsWithData : RDD[((K,K)),(V,V))]
,使得它包含来自pairs
所有元素作为键元组及其对应的值(来自data
)as价值元组?
Some properties of the data: 数据的一些属性:
data
are unique data
中的键是唯一的 pairs
are unique pairs
所有条目都是唯一的 (k1,k2)
in pairs
it is guaranteed that k1 <= k2
(k1,k2)
中pairs
可以保证k1 <= k2
|pairs| = O(|data|)
|pairs| = O(|data|)
的大小的常量 |pairs| = O(|data|)
|data| ~ 10^8, |pairs| ~ 10^10
|data| ~ 10^8, |pairs| ~ 10^10
|data| ~ 10^8, |pairs| ~ 10^10
Here is some example code in Scala: 以下是Scala中的一些示例代码:
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext._
// This kind of show the idea, but fails at runtime.
def massPairLookup1(keyPairs : RDD[(Int, Int)], data : RDD[(Int, String)]) = {
keyPairs map {case (k1,k2) =>
val v1 : String = data lookup k1 head;
val v2 : String = data lookup k2 head;
((k1, k2), (v1,v2))
}
}
// Works but is O(|data|^2)
def massPairLookup2(keyPairs : RDD[(Int, Int)], data : RDD[(Int, String)]) = {
// Construct all possible pairs of values
val cartesianData = data cartesian data map {case((k1,v1),(k2,v2)) => ((k1,k2),(v1,v2))}
// Select only the values who's keys are in keyPairs
keyPairs map {(_,0)} join cartesianData mapValues {_._2}
}
// Example function that find pairs of keys
// Runs in O(|data|) in real life, but cannot maintain the values
def relevantPairs(data : RDD[(Int, String)]) = {
val keys = data map (_._1)
keys cartesian keys filter {case (x,y) => x*y == 12 && x < y}
}
// Example run
val data = sc parallelize(1 to 12) map (x => (x, "Number " + x))
val pairs = relevantPairs(data)
val pairsWithData = massPairLookup2(pairs, data)
// Print:
// ((1,12),(Number1,Number12))
// ((2,6),(Number2,Number6))
// ((3,4),(Number3,Number4))
pairsWithData.foreach(println)
Attempt 1 尝试1
First I tried just using the lookup
function on data
, but that throws an runtime error when executed. 首先,我尝试在
data
上使用lookup
函数,但在执行时会抛出运行时错误。 It seems like self
is null in the PairRDDFunctions
trait. 好像
self
是在空PairRDDFunctions
特征。
In addition I am not sure about the performance of lookup
. 另外我不确定
lookup
的性能。 The documentation says This operation is done efficiently if the RDD has a known partitioner by only searching the partition that the key maps to. 文档说如果RDD通过仅搜索键映射到的分区而具有已知分区器,则此操作有效地完成。 This sounds like
n
lookups takes O(n*|partition|) time at best, which I suspect could be optimized. 这听起来像
n
查找最多需要O(n * |分区|)时间,我怀疑可以优化。
Attempt 2 尝试2
This attempt works, but I create |data|^2
pairs which will kill performance. 这种尝试有效,但我创建了
|data|^2
对会破坏性能。 I do not expect Spark to be able to optimize that away. 我不希望Spark能够优化它。
Your lookup 1 doesn't work because you cannot perform RDD transformations inside workers (inside another transformation). 您的查找1不起作用,因为您无法在工作者内部执行RDD转换(在另一个转换中)。
In the lookup 2, I don't think it's necessary to perform full cartesian... 在查找2中,我认为没有必要执行完整的笛卡尔...
You can do it like this: 你可以这样做:
val firstjoin = pairs.map({case (k1,k2) => (k1, (k1,k2))})
.join(data)
.map({case (_, ((k1, k2), v1)) => ((k1, k2), v1)})
val result = firstjoin.map({case ((k1,k2),v1) => (k2, ((k1,k2),v1))})
.join(data)
.map({case(_, (((k1,k2), v1), v2))=>((k1, k2), (v1, v2))})
Or in a more dense form: 或者以更密集的形式:
val firstjoin = pairs.map(x => (x._1, x)).join(data).map(_._2)
val result = firstjoin.map({case (x,y) => (x._2, (x,y))})
.join(data).map({case(x, (y, z))=>(y._1, (y._2, z))})
I don't think you can do it more efficiently, but I might be wrong... 我认为你不能更有效地做到这一点,但我可能错了......
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.