简体   繁体   English

Scala Spark-将RDD与mllib结合使用

[英]Scala Spark - using RDD with mllib

I have data in the form of RDD[List[Double], List[Double]], for example: 我有RDD [List [Double],List [Double]]形式的数据,例如:

sampleData =
    (
        ((1.1, 1.2, 1.3), (1.1, 1.5, 1.2)),
        ((3.0, 3.3, 3.3), (3.1, 3.2, 3.6))
    )

I would like to call Statistics.corr(a, b) where a is from the first List[Double] and b is from the second List[Double] 我想调用Statistics.corr(a,b),其中a来自第一个List [Double],b来自第二个List [Double]

The result I would like is 2 correlation values from the corr() function for (1.1, 1.2, 1.3), (1.1, 1.5, 1.2) and (3.0, 3.3, 3.3), (3.1, 3.2, 3.6) 我想要的结果是来自corr()函数的(2.1、1.2、1.3),(1.1、1.5、1.2)和(3.0、3.3、3.3),(3.1、3.2、3.6)的2个相关值

My attempted solution is: 我尝试的解决方案是:

Statistics.corr(sampleData.flatMap(_._1), sampleData.flatMap(_._2))

This is giving me a single correlation for (1.1, 1.2, 1.3, 3.0, 3.3, 3.3), (1.1, 1.5, 1.2, 3.1, 3.2, 3.6), which is not what I want 这给了我(1.1、1.2、1.3、3.0、3.3、3.3),(1.1、1.5、1.2、3.1、3.2、3.6)的单个相关性,这不是我想要的

This calls for map, not flatmap, since you want to keep the rows of the RDD separate. 这需要地图,而不是平面地图,因为您要保持RDD的行分开。

Unfortunately, I'm not yet aware of a serializable correlation function that will operate on two List[Double]s. 不幸的是,我还不知道将在两个List [Double]上运行的可序列化相关函数。 The first place I checked was Pearson correlation from Apache Commons , but it's not serializable. 我检查的第一处是来自Apache Commons的Pearson相关性 ,但它不可序列化。 You may have to write your own function (but I'd spend some more effort looking first). 您可能必须编写自己的函数(但我会花更多的精力首先查找)。 Once you have a correlation function, you'll use it like follows: 一旦有了相关函数,就可以像下面这样使用它:

sampleData.map(x => correlation(x._1,x._2))

This will still be an RDD, and it will have no reference to the original row it came from besides the order, so you may want to pass the original data along (or, at least, whatever id it used to have). 这仍将是RDD,并且除了订单外,将不再引用它来自的原始行,因此您可能希望传递原始数据(或至少传递它曾经拥有的任何id)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM