[英]How to calculate the mean of each pair in an RDD consisting of (Key, [Value]) pairs in Spark?
I'm very new to both Scala and Spark, so please forgive me if I'm going about this completley wrong. 我对Scala和Spark都很新,所以请原谅我,如果我正在解决这个错误的错误。 After taking in a csv file, filtering, and mapping;
在获取csv文件,过滤和映射之后; I have an RDD that is a bunch of (String, Double) pairs.
我有一个RDD,它是一堆(String,Double)对。
(b2aff711,-0.00510)
(ae095138,0.20321)
(etc.)
When I use .groupByKey( ) on the RDD, 当我在RDD上使用.groupByKey()时,
val grouped = rdd1.groupByKey()
to get a RDD with a bunch of (String, [Double]) pairs. 获得一堆(String,[Double])对的RDD。 (I don't know what CompactBuffer means, maybe could be causing my issue?)
(我不知道CompactBuffer的意思,也许可能导致我的问题?)
(32540b03,CompactBuffer(-0.00699, 0.256023))
(a93dec11,CompactBuffer(0.00624))
(32cc6532,CompactBuffer(0.02337, -0.05223, -0.03591))
(etc.)
Once they are grouped I am trying to take the mean and the standard deviation. 一旦他们被分组,我试图取平均值和标准差。 I want to simply use the .mean( ) and .sampleStdev( ).
我想简单地使用.mean()和.sampleStdev()。 When I try to create a new RDD of the means,
当我尝试创建一个新的RDD的手段,
val mean = grouped.mean()
an error is returned 返回错误
Error:(51, 22) value mean is not a member of org.apache.spark.rdd.RDD[(String, Iterable[Double])]
错误:(51,22)value mean不是org.apache.spark.rdd.RDD [(String,Iterable [Double])]的成员
val mean = grouped.mean( )
val mean = grouped.mean()
I have imported org.apache.spark.SparkContext._ 我导入了org.apache.spark.SparkContext._
I also tried using the sampleStdev( ), .sum( ), .stats( ) with the same results. 我也尝试使用sampleStdev(),. sum(),. stats(),结果相同。 Whatever the problem, it appears to be affecting all of the numeric RDD operations.
无论出现什么问题,它似乎都会影响所有数字RDD操作。
Let's consider the following : 我们考虑以下几点:
val data = List(("32540b03",-0.00699), ("a93dec11",0.00624),
("32cc6532",0.02337) , ("32540b03",0.256023),
("32cc6532",-0.03591),("32cc6532",-0.03591))
val rdd = sc.parallelize(data.toSeq).groupByKey().sortByKey()
One way to compute the mean for each pair is the following : 计算每对平均值的一种方法如下:
You need to define an average method : 您需要定义一个平均方法:
def average[T]( ts: Iterable[T] )( implicit num: Numeric[T] ) = {
num.toDouble( ts.sum ) / ts.size
}
You can apply your method on the rdd as followed : 您可以在rdd上应用您的方法,如下所示:
val avgs = rdd.map(x => (x._1, average(x._2)))
You can check : 你可以查看:
avgs.take(3)
and this is the result : 这就是结果:
res4: Array[(String, Double)] = Array((32540b03,0.1245165), (32cc6532,-0.016149999999999998), (a93dec11,0.00624))
The officiant way would be to use reduceByKey
instead of groupByKey
. reduceByKey
方式是使用reduceByKey
而不是groupByKey
。
val result = sc.parallelize(data)
.map { case (key, value) => (key, (value, 1)) }
.reduceByKey { case ((value1, count1), (value2, count2))
=> (value1 + value2, count1 + count2)}
.mapValues {case (value, count) => value.toDouble / count.toDouble}
On the other hand the problem in your solution is that grouped
an RDD of objects of the form (String, Iterable[Double])
(just as in an error). 另一方面,解决方案中的问题是将表单对象的RDD
grouped
(String, Iterable[Double])
(就像在错误中一样)。 You can for example calculate mean of RDD of Ints or doubles but what would be the mean of rdd of pairs. 例如,您可以计算Ints或double的RDD的平均值,但是对的rdd的平均值是多少。
Here is a full program without a custom function: 这是一个没有自定义功能的完整程序:
val conf = new SparkConf().setAppName("means").setMaster("local[*]")
val sc = new SparkContext(conf)
val data = List(("Lily", 23), ("Lily", 50),
("Tom", 66), ("Tom", 21), ("Tom", 69),
("Max", 11), ("Max", 24))
val RDD = sc.parallelize(data)
val counts = RDD.map(item => (item._1, (1, item._2.toDouble)) )
val countSums = counts.reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2) )
val keyMeans = countSums.mapValues(avgCount => avgCount._2 / avgCount._1)
for ((key, mean) <- keyMeans.collect()) println(key + " " + mean)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.