PySpark - 对 dataframe 中的列求和并将结果返回为 int

Question

I have a pyspark dataframe with a column of numbers.我有一个带有一列数字的 pyspark dataframe。 I need to sum that column and then have the result return as an int in a python variable.我需要对该列求和，然后将结果作为 int 返回到 python 变量中。

df = spark.createDataFrame([("A", 20), ("B", 30), ("D", 80)],["Letter", "Number"])

I do the following to sum the column.我执行以下操作来汇总列。

df.groupBy().sum()

But I get a dataframe back.但我得到了 dataframe 回来。

+-----------+
|sum(Number)|
+-----------+
|        130|
+-----------+

I would 130 returned as an int stored in a variable to be used else where in the program.我会将 130 作为存储在变量中的 int 返回，以便在程序中的其他位置使用。

result = 130

Answer 1

I think the simplest way:我认为最简单的方法：

df.groupBy().sum().collect()

will return a list.将返回一个列表。 In your example:在你的例子中：

In [9]: df.groupBy().sum().collect()[0][0]
Out[9]: 130

Answer 2

The simplest way really :最简单的方法真的：

df.groupBy().sum().collect()

But it is very slow operation: Avoid groupByKey , you should use RDD and reduceByKey:但它的操作很慢：避免 groupByKey ，你应该使用 RDD 和 reduceByKey：

df.rdd.map(lambda x: (1,x[1])).reduceByKey(lambda x,y: x + y).collect()[0][1]

I tried on a bigger dataset and i measured the processing time:我尝试了一个更大的数据集，并测量了处理时间：

RDD and ReduceByKey : 2.23 s RDD 和 ReduceByKey： 2.23 秒

GroupByKey: 30.5 s GroupByKey：30.5 秒

Answer 3

If you want a specific column :如果你想要一个特定的列：

import pyspark.sql.functions as F     

df.agg(F.sum("my_column")).collect()[0][0]

Answer 4

This is another way you can do this.这是您可以执行此操作的另一种方法。 using agg and collect :使用agg和collect ：

sum_number = df.agg({"Number":"sum"}).collect()[0]

result = sum_number["sum(Number)"]

Answer 5

Similar to other answers, but without the use of a groupby or agg.与其他答案类似，但不使用 groupby 或 agg。 I just select the column in question, sum it, collect it, and then grab the first two indices to return an int.我只是 select 有问题的列，求和，收集它，然后获取前两个索引以返回一个 int。 The only reason I chose this over the accepted answer is I am new to pyspark and was confused that the 'Number' column was not explicitly summed in the accepted answer.我选择这个而不是接受的答案的唯一原因是我是 pyspark 的新手，并且对“数字”列没有明确地汇总在接受的答案中感到困惑。 If I had to come back after sometime and try to understand what was happening, syntax such as below would be easier for me to follow.如果我必须在一段时间后回来尝试了解正在发生的事情，那么下面的语法对我来说会更容易理解。

import pyspark.sql.functions as f   

df.select(f.sum('Number')).collect()[0][0]

Answer 6

You can also try using first() function.您也可以尝试使用first()函数。 It returns the first row from the dataframe, and you can access values of respective columns using indices.它返回数据框中的第一行，您可以使用索引访问各个列的值。

df.groupBy().sum().first()[0]

In your case, the result is a dataframe with single row and column, so above snippet works.在您的情况下，结果是具有单行和单列的数据框，因此上面的代码段有效。

Answer 7

Select column as RDD, abuse keys() to get value in Row (or use .map(lambda x: x[0]) ), then use RDD sum: Select 列作为 RDD，滥用keys()以获取 Row 中的值（或使用.map(lambda x: x[0]) ），然后使用 RDD 和：

df.select("Number").rdd.keys().sum()

SQL sum using selectExpr : SQL 使用selectExpr求和：

df.selectExpr("sum(Number)").first()[0]

Answer 8

以下应该工作：

df.groupBy().sum().rdd.map(lambda x: x[0]).collect()

Answer 9

sometimes read a csv file to pyspark Dataframe, maybe the numeric column change to string type '23',like this, you should use pyspark.sql.functions.sum to get the result as int , not sum()有时读取 csv 文件到 pyspark Dataframe，也许数字列更改为字符串类型 '23'，像这样，您应该使用 pyspark.sql.functions.sum 将结果作为 int ，而不是 sum()

import pyspark.sql.functions as F                                                    
df.groupBy().agg(F.sum('Number')).show()

PySpark - 对 dataframe 中的列求和并将结果返回为 int

问题描述

9 个解决方案

解决方案1
32 2018-05-22 12:00:08

解决方案2
30 已采纳 2018-08-15 08:59:46

解决方案3
18 2019-10-14 13:14:08

解决方案4
11 2018-10-28 07:18:40

解决方案5
1 2021-09-09 02:37:04

解决方案6
0 2021-04-20 11:26:22

解决方案7
0 2021-10-06 17:07:40

解决方案8
-2 2017-12-14 12:05:36

解决方案9
-2 2019-06-21 08:03:16

PySpark - 对 dataframe 中的列求和并将结果返回为 int

问题描述

9 个解决方案

解决方案1 32 2018-05-22 12:00:08

解决方案2 30 已采纳 2018-08-15 08:59:46

解决方案3 18 2019-10-14 13:14:08

解决方案4 11 2018-10-28 07:18:40

解决方案5 1 2021-09-09 02:37:04

解决方案6 0 2021-04-20 11:26:22

解决方案7 0 2021-10-06 17:07:40

解决方案8 -2 2017-12-14 12:05:36

解决方案9 -2 2019-06-21 08:03:16

解决方案1
32 2018-05-22 12:00:08

解决方案2
30 已采纳 2018-08-15 08:59:46

解决方案3
18 2019-10-14 13:14:08

解决方案4
11 2018-10-28 07:18:40

解决方案5
1 2021-09-09 02:37:04

解决方案6
0 2021-04-20 11:26:22

解决方案7
0 2021-10-06 17:07:40

解决方案8
-2 2017-12-14 12:05:36

解决方案9
-2 2019-06-21 08:03:16