简体   繁体   English

如何在flatMap Scala-Spark中将值赋给微风矩阵?

[英]How to assign value into a breeze Matrix in flatMap Scala-Spark?

i want to initialize a matrix using data in flatMap , this is my data: 我想使用flatMap中的数据初始化矩阵,这是我的数据:

-4,0,1.0 ### horrible . not-work install dozen scanner umax ofcourse . tech-support everytime call . fresh install work error . crummy product crummy tech-support crummy experience .
2,1,1.0 ### scanner run . grant product run windows . live fact driver windows lose performance . setup program alert support promptly quits . amazon . website product package requirement listing compatible windows .
1,2,1.0 ### conversion kit spare battery total better stick versionand radio blow nimh charger battery . combination operation size nimh battery . motorola kit . rechargable battery available flashlight camera game toy .
-4,3,1.0 ### recieive part autowinder catch keep place sudden break . hold listen music winder wind . extremely frustrated fix pull little hard snap half . flush drain .

and this is my code: 这是我的代码:

val spark_context = new SparkContext(conf)
 val data = spark_context.textFile(Input)
 val Gama=DenseMatrix.zeros[Double](4,2)
 var gmmainit = data.flatMap(line => {
   val tuple = line.split("###")
   val ss = tuple(0)
   val re = """^(-?\d+)\s*,\s*(\d+)\s*,\s*(\d+).*$""".r
   val re(n1, n2, n3) = ss // pattern match and extract values

   if (n1.toInt >= 0) {
     Gama(n2.toInt, 0) += 1
   }
   if (n1.toInt < 0) {
     Gama(n2.toInt, 1) += 1
   }
 })

 println(Gama)

but it doesn't change Gama matrix, 但这不会改变伽马矩阵

how can i modify my code to solve this problem? 如何修改我的代码以解决此问题?

You can't modify variables in your distributed functions. 您不能在分布式函数中修改变量。 Well, you can, but the variable is only modified in THAT process. 可以,但是只能在该过程中修改变量。 Remember that spark is distributed. 请记住,火花是分布的。 So, you need to return a value that can be flattened (I don't know DenseMatrix well enough to say the exact need here). 因此,您需要返回一个可以展平的值(我对DenseMatrix的了解不够,无法在此处说出确切的需求)。 You might be able to create a custom accumulator to accomplish this though, if it can be associative and commutative. 但是,如果它可以是关联的和可交换的,则可能可以创建一个自定义累加器来完成此任务。

First of all your code won't even compile. 首先,您的代码甚至不会编译。 If you take a look at the flatMap signature: 如果您查看flatMap签名:

flatMap[U](f: T => TraversableOnce[U])

you'll see it maps from T to TraversableOnce[U] . 您会看到它从T映射到TraversableOnce[U] Since update method of DenseMatrix returns Unit function you use is of type String => Unit and Unit is not TraversableOnce . 由于DenseMatrix update方法返回的Unit函数是String => Unit并且Unit不是TraversableOnce

Moreover, as already explained by Justin , each partition gets its own local copy of the variables referenced in a closure and only that copy is modified. 而且,正如Justin所解释的那样,每个分区都获得其自己在闭包中引用的变量的本地副本,并且仅修改该副本。

One way can you solve this problem is something like this: 解决此问题的一种方法是:

val gmmainit = data.mapPartitions(iter => {
  val re = """^(-?\d+)\s*,\s*(\d+)\s*,\s*(\d+).*$""".r
  val gama = DenseMatrix.zeros[Double](4,2)
  iter.foreach{
    case re(n1, n2, n3) =>  gama(n2.toInt, if(n1.toInt >= 0) 0 else 1) += 1
    case _ =>
  }
  Iterator(gama)
}).reduce(_ + _)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM