[英]Spark - Sum of row values
I have the following DataFrame: 我有以下DataFrame:
January | February | March
-----------------------------
10 | 10 | 10
20 | 20 | 20
50 | 50 | 50
I'm trying to add a column to this which is the sum of the values of each row. 我正在尝试为此添加一列,这是每行值的总和。
January | February | March | TOTAL
----------------------------------
10 | 10 | 10 | 30
20 | 20 | 20 | 60
50 | 50 | 50 | 150
As far as I can see, all the built in aggregate functions seem to be for calculating values in single columns. 据我所知,所有内置的聚合函数似乎都是用于计算单列中的值。 How do I go about using values across columns on a per row basis (using Scala)?
如何在每行的基础上跨列使用值(使用Scala)?
I've gotten as far as 我已经到了
val newDf: DataFrame = df.select(colsToSum.map(col):_*).foreach ...
You were very close with this: 你非常接近这个:
val newDf: DataFrame = df.select(colsToSum.map(col):_*).foreach ...
Instead, try this: 相反,试试这个:
val newDf = df.select(colsToSum.map(col).reduce((c1, c2) => c1 + c2) as "sum")
I think this is the best of the the answers, because it is as fast as the answer with the hard-coded SQL query, and as convenient as the one that uses the UDF
. 我认为这是最好的答案,因为它与使用硬编码的SQL查询的答案一样快,并且与使用
UDF
的答案一样方便。 It's the best of both worlds -- and I didn't even add a full line of code! 这是两全其美的 - 我甚至没有添加完整的代码!
Alternatively and using Hugo's approach and example, you can create a UDF
that receives any quantity of columns and sum
them all. 或者,使用Hugo的方法和示例,您可以创建一个接收任意数量的列并将它们全部
sum
的UDF
。
from functools import reduce
def superSum(*cols):
return reduce(lambda a, b: a + b, cols)
add = udf(superSum)
df.withColumn('total', add(*[df[x] for x in df.columns])).show()
+-------+--------+-----+-----+
|January|February|March|total|
+-------+--------+-----+-----+
| 10| 10| 10| 30|
| 20| 20| 20| 60|
+-------+--------+-----+-----+
This code is in Python, but it can be easily translated: 此代码使用Python,但可以轻松翻译:
# First we create a RDD in order to create a dataFrame:
rdd = sc.parallelize([(10, 10,10), (20, 20,20)])
df = rdd.toDF(['January', 'February', 'March'])
df.show()
# Here, we create a new column called 'TOTAL' which has results
# from add operation of columns df.January, df.February and df.March
df.withColumn('TOTAL', df.January + df.February + df.March).show()
Output: 输出:
+-------+--------+-----+
|January|February|March|
+-------+--------+-----+
| 10| 10| 10|
| 20| 20| 20|
+-------+--------+-----+
+-------+--------+-----+-----+
|January|February|March|TOTAL|
+-------+--------+-----+-----+
| 10| 10| 10| 30|
| 20| 20| 20| 60|
+-------+--------+-----+-----+
You can also create an User Defined Function it you want, here a link of Spark documentation: UserDefinedFunction (udf) 您还可以创建所需的用户定义函数,这里是Spark文档的链接: UserDefinedFunction(udf)
Working Scala example with dynamic column selection: 使用动态列选择的Scala示例:
import sqlContext.implicits._
val rdd = sc.parallelize(Seq((10, 10, 10), (20, 20, 20)))
val df = rdd.toDF("January", "February", "March")
df.show()
+-------+--------+-----+
|January|February|March|
+-------+--------+-----+
| 10| 10| 10|
| 20| 20| 20|
+-------+--------+-----+
val sumDF = df.withColumn("TOTAL", df.columns.map(c => col(c)).reduce((c1, c2) => c1 + c2))
sumDF.show()
+-------+--------+-----+-----+
|January|February|March|TOTAL|
+-------+--------+-----+-----+
| 10| 10| 10| 30|
| 20| 20| 20| 60|
+-------+--------+-----+-----+
您可以使用expr()。在scala中使用
df.withColumn("TOTAL", expr("January+February+March"))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.