简体   繁体   English

Spark数据帧reducebykey就像操作一样

[英]Spark dataframe reducebykey like operation

I have a Spark dataframe with the following data (I use spark-csv to load the data in): 我有一个带有以下数据的Spark数据帧(我使用spark-csv来加载数据):

key,value
1,10
2,12
3,0
1,20

is there anything similar to spark RDD reduceByKey which can return a Spark DataFrame as: (basically, summing up for the same key values) 是否有类似于spark RDD reduceByKey东西可以返回Spark DataFrame :(基本上,总结相同的键值)

key,value
1,30
2,12
3,0

(I can transform the data to RDD and do a reduceByKey operation, but is there a more Spark DataFrame API way to do this?) (我可以将数据转换为RDD并执行reduceByKey操作,但是有更多的Spark DataFrame API方法可以做到这一点吗?)

If you don't care about column names you can use groupBy followed by sum : 如果您不关心列名,可以使用groupBy后跟sum

df.groupBy($"key").sum("value")

otherwise it is better to replace sum with agg : 否则最好用agg替换sum

df.groupBy($"key").agg(sum($"value").alias("value"))

Finally you can use raw SQL: 最后你可以使用原始SQL:

df.registerTempTable("df")
sqlContext.sql("SELECT key, SUM(value) AS value FROM df GROUP BY key")

See also DataFrame / Dataset groupBy behaviour/optimization 另请参见DataFrame / Dataset groupBy行为/优化

How about this? 这个怎么样? I agree this still converts to rdd then to dataframe. 我同意这仍然转换为rdd然后转换为数据帧。

df.select('key','value').map(lambda x: x).reduceByKey(lambda a,b: a+b).toDF(['key','value'])

I think user goks missed out on some part in the code. 我认为用户goks错过了代码中的某些部分。 Its not a tested code. 它不是经过测试的代码。

.map should have been used to convert the rdd to a pairRDD using .map(lambda x: (x,1)).reduceByKey. .map应该用于使用.map(lambda x:(x,1))。reduceByKey将rdd转换为pairRDD。 .... ....

reduceByKey is not available on a single value rdd or regular rdd but pairRDD. reduceByKey不适用于单个值rdd或常规rdd,而是pairRDD。

Thx 谢谢

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM