简体   繁体   English

Apache Spark的RDD [Vector]不变性问题

[英]Apache Spark's RDD[Vector] Immutability issue

I know the RDDs are immutable and therefore their value cannot be changed but I see the following behaviour: 我知道RDD是不可变的,因此不能更改其值,但是我看到以下行为:

I wrote an implementation for FuzzyCMeans ( https://github.com/salexln/FinalProject_FCM ) algorithm and now I'm testing it, so I run the following example: 我为FuzzyCMeans( https://github.com/salexln/FinalProject_FCM )算法编写了一个实现,现在我正在对其进行测试,因此我运行以下示例:

import org.apache.spark.mllib.clustering.FuzzyCMeans
import org.apache.spark.mllib.linalg.Vectors

val data = sc.textFile("/home/development/myPrjects/R/butterfly/butterfly.txt")
val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble))).cache()
> parsedData: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MapPartitionsRDD[2] at map at <console>:31

val numClusters = 2
val numIterations = 20


parsedData.foreach{ point => println(point) }
> [0.0,-8.0]
[-3.0,-2.0]
[-3.0,0.0]
[-3.0,2.0]
[-2.0,-1.0]
[-2.0,0.0]
[-2.0,1.0]
[-1.0,0.0]
[0.0,0.0]
[1.0,0.0]
[2.0,-1.0]
[2.0,0.0]
[2.0,1.0]
[3.0,-2.0]
[3.0,0.0]
[3.0,2.0]
[0.0,8.0] 

val clusters = FuzzyCMeans.train(parsedData, numClusters, numIteration
parsedData.foreach{ point => println(point) }
> 
[0.0,-0.4803333185624595]
[-0.1811743096972924,-0.12078287313152826]
[-0.06638890786148487,0.0]
[-0.04005925925925929,0.02670617283950619]
[-0.12193263222069807,-0.060966316110349035]
[-0.0512,0.0]
[NaN,NaN]
[-0.049382716049382706,0.0]
[NaN,NaN]
[0.006830134553650707,0.0]
[0.05120000000000002,-0.02560000000000001]
[0.04755220304297078,0.0]
[0.06581619798335057,0.03290809899167529]
[0.12010867103812725,-0.0800724473587515]
[0.10946638900458144,0.0]
[0.14814814814814817,0.09876543209876545]
[0.0,0.49119985188436205] 

But how can this be that my method changes the Immutable RDD? 但是我的方法如何更改不可变RDD?

BTW, the signature of the train method, is the following: BTW,火车方法的签名,如下所示:

train( data: RDD[Vector], clusters: Int, maxIterations: Int) 火车(数据:RDD [Vector],簇:Int,maxIterations:Int)

What you are doing is precisely described in the docs : 您所做的工作在docs中进行了精确描述:

Printing elements of an RDD RDD的打印元素

Another common idiom is attempting to print out the elements of an RDD using rdd.foreach(println) or rdd.map(println). 另一个常见用法是尝试使用rdd.foreach(println)或rdd.map(println)打印出RDD的元素。 On a single machine, this will generate the expected output and print all the RDD's elements. 在单台机器上,这将生成预期的输出并打印所有RDD的元素。 However, in cluster mode, the output to stdout being called by the executors is now writing to the executor's stdout instead, not the one on the driver, so stdout on the driver won't show these! 但是,在集群模式下,执行者正在调用stdout的输出现在正在写入执行者的stdout,而不是驱动程序上的那个,因此驱动程序上的stdout不会显示这些内容! To print all elements on the driver, one can use the collect() method to first bring the RDD to the driver node thus: rdd.collect().foreach(println). 要在驱动程序上打印所有元素,可以使用collect()方法首先将RDD带到驱动程序节点:rdd.collect()。foreach(println)。 This can cause the driver to run out of memory, though, because collect() fetches the entire RDD to a single machine; 但是,这可能会导致驱动程序用尽内存,因为collect()将整个RDD提取到一台计算机上。 if you only need to print a few elements of the RDD, a safer approach is to use the take(): rdd.take(100).foreach(println). 如果只需要打印RDD的一些元素,则更安全的方法是使用take():rdd.take(100).foreach(println)。

So, as data can migrate between nodes, the same output of foreach is not guaranteed. 因此,由于数据可以在节点之间迁移,因此不能保证foreach的相同输出。 RDD is immutable, but you should extract data in appropriate way, as you don't have the whole RDD at your node. RDD是不可变的,但是您应该以适当的方式提取数据,因为您的节点上没有整个RDD。


Another possible issue (not in your case as you're using an immutable vector) is using mutable data inside Point iself, which is completely incorrect, so you'd lose all guarantees - the RDD itself is still gonna be immutable however. 另一个可能的问题(而不是在你的情况为你使用一个不变的向量)使用可变数据里面Point iself,这是完全不正确,那么你会失去一切保障-在RDD本身仍然会成为不可改变但是。

In order the RDD to be completely immutable, its content should be immutable as well: 为了使RDD完全不可变,其内容也应该不可变:

scala> val m = Array.fill(2, 2)(0)
m: Array[Array[Int]] = Array(Array(0, 0), Array(0, 0))

scala> val rdd = sc.parallelize(m)
rdd: org.apache.spark.rdd.RDD[Array[Int]] = ParallelCollectionRDD[1]
at parallelize at <console>:23

scala> rdd.collect()
res6: Array[Array[Int]] = Array(Array(0, 0), Array(0, 0))

scala> m(0)(1) = 2

scala> rdd.collect()
res8: Array[Array[Int]] = Array(Array(0, 2), Array(0, 0)) 

so because the Array is mutable, I could change it and therefore the RDD was updated with the new data 所以因为Array是可变的,所以我可以更改它,因此RDD已用新数据更新

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM