[英]How to join a random rdd to another rdd?
I have an RDD
of strings (but could be anything really) that I would like to innerjoin with a rdd
of random normals. 我有一个
RDD
字符串(但实际上可以是任何东西),我想与rdd
随机法线进行内部连接。 I know this can be solved with a .zipWithIndex on both RDDs but this doesn't seem like it will scale well, is there a way to initialize a random rdd
with data from another RDD
or another method that would be faster? 我知道这可以在两个RDD上都使用.zipWithIndex来解决,但这似乎无法很好地扩展,是否有一种方法可以用来自另一个
RDD
数据或另一个更快的方法来初始化随机rdd
? Here is what I've done with .zipWithIndex
: 这是我对
.zipWithIndex
所做的工作:
import org.apache.spark.mllib.random.RandomRDDs
import org.apache.spark.rdd.RDD
val numExamples = 10 // number of rows in RDD
val maNum = 7
val commonStdDev = 0.1 // common standard deviation 1/10, makes variance = 0.01
val normalVectorRDD = RandomRDDs.normalVectorRDD(sc, numRows = numExamples, numCols = maNum)
val rescaledNormals = normalVectorRDD.map{myVec => myVec.toArray.map(x => x*commonStdDev)}
.zipWithIndex
.map{case (key,value) => (value,(key))}
val otherRDD = sc.textFile(otherFilepath)
.zipWithIndex
.map{case (key,value) => (value,(key))}
val joinedRDD = otherRDD.join(rescaledNormals).map{case(key,(other,dArray)) => (other,dArray)}
In general I wouldn't worry about zipWithIndex
. 通常,我不会担心
zipWithIndex
。 While it requires additional actions it belongs to relatively cheap operations. 尽管它需要其他操作,但它属于相对便宜的操作。
join
however is a different thing. 但是
join
是另一回事。
Since vector content doesn't depend on the value from the otherRDD
, it makes more sense to generate it in place. 由于向量内容不依赖于
otherRDD
的值,因此otherRDD
生成向量更有意义。 All you have to do is to mimic RandomRDDs
logic: 您要做的就是模仿
RandomRDDs
逻辑:
import org.apache.spark.mllib.random.StandardNormalGenerator
import org.apache.spark.ml.linalg.DenseVector // or org.apache.spark.mllib
val vectorSize = 42
val stdDev = 0.1
val seed = scala.util.Random.nextLong // Or set manually
// Define seeds for each partition
val random = new scala.util.Random(seed)
val seeds = (0 until otherRDD.getNumPartitions).map(
i => i -> random.nextLong
).toMap
otherRDD.mapPartitionsWithIndex((i, iter) => {
val generator = new StandardNormalGenerator()
generator.setSeed(seeds(i))
iter.map(x =>
(x, new DenseVector(Array.fill(vectorSize)(generator.nextValue() * stdDev)))
)
})
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.