Spark数据帧reducebykey就像操作一样

Question

I have a Spark dataframe with the following data (I use spark-csv to load the data in): 我有一个带有以下数据的Spark数据帧（我使用spark-csv来加载数据）：

key,value
1,10
2,12
3,0
1,20

is there anything similar to spark RDD reduceByKey which can return a Spark DataFrame as: (basically, summing up for the same key values) 是否有类似于spark RDD reduceByKey东西可以返回Spark DataFrame :(基本上，总结相同的键值）

key,value
1,30
2,12
3,0

(I can transform the data to RDD and do a reduceByKey operation, but is there a more Spark DataFrame API way to do this?) （我可以将数据转换为RDD并执行reduceByKey操作，但是有更多的Spark DataFrame API方法可以做到这一点吗？）

Answer 1

If you don't care about column names you can use groupBy followed by sum : 如果您不关心列名，可以使用groupBy后跟sum ：

df.groupBy($"key").sum("value")

otherwise it is better to replace sum with agg : 否则最好用agg替换sum ：

df.groupBy($"key").agg(sum($"value").alias("value"))

Finally you can use raw SQL: 最后你可以使用原始SQL：

df.registerTempTable("df")
sqlContext.sql("SELECT key, SUM(value) AS value FROM df GROUP BY key")

See also DataFrame / Dataset groupBy behaviour/optimization 另请参见DataFrame / Dataset groupBy行为/优化

Answer 2

How about this? 这个怎么样？ I agree this still converts to rdd then to dataframe. 我同意这仍然转换为rdd然后转换为数据帧。

df.select('key','value').map(lambda x: x).reduceByKey(lambda a,b: a+b).toDF(['key','value'])

Answer 3

I think user goks missed out on some part in the code. 我认为用户goks错过了代码中的某些部分。 Its not a tested code. 它不是经过测试的代码。

.map should have been used to convert the rdd to a pairRDD using .map(lambda x: (x,1)).reduceByKey. .map应该用于使用.map（lambda x：（x，1））。reduceByKey将rdd转换为pairRDD。 .... ....

reduceByKey is not available on a single value rdd or regular rdd but pairRDD. reduceByKey不适用于单个值rdd或常规rdd，而是pairRDD。

Thx 谢谢

Spark数据帧reducebykey就像操作一样

问题描述

3 个解决方案

解决方案1
17 已采纳 2015-12-13 12:35:05

解决方案2
0 2018-08-24 18:53:12

解决方案3
0 2019-07-11 06:16:59

Spark数据帧reducebykey就像操作一样

问题描述

3 个解决方案

解决方案1 17 已采纳 2015-12-13 12:35:05

解决方案2 0 2018-08-24 18:53:12

解决方案3 0 2019-07-11 06:16:59

解决方案1
17 已采纳 2015-12-13 12:35:05

解决方案2
0 2018-08-24 18:53:12

解决方案3
0 2019-07-11 06:16:59