简体   繁体   English

PySpark - 对 dataframe 中的列求和并将结果返回为 int

[英]PySpark - Sum a column in dataframe and return results as int

I have a pyspark dataframe with a column of numbers.我有一个带有一列数字的 pyspark dataframe。 I need to sum that column and then have the result return as an int in a python variable.我需要对该列求和,然后将结果作为 int 返回到 python 变量中。

df = spark.createDataFrame([("A", 20), ("B", 30), ("D", 80)],["Letter", "Number"])

I do the following to sum the column.我执行以下操作来汇总列。


But I get a dataframe back.但我得到了 dataframe 回来。

|        130|

I would 130 returned as an int stored in a variable to be used else where in the program.我会将 130 作为存储在变量中的 int 返回,以便在程序中的其他位置使用。

result = 130

I think the simplest way:我认为最简单的方法:


will return a list.将返回一个列表。 In your example:在你的例子中:

In [9]: df.groupBy().sum().collect()[0][0]
Out[9]: 130

The simplest way really :最简单的方法真的:


But it is very slow operation: Avoid groupByKey , you should use RDD and reduceByKey:但它的操作很慢: 避免 groupByKey ,你应该使用 RDD 和 reduceByKey:

df.rdd.map(lambda x: (1,x[1])).reduceByKey(lambda x,y: x + y).collect()[0][1]

I tried on a bigger dataset and i measured the processing time:我尝试了一个更大的数据集,并测量了处理时间:

RDD and ReduceByKey : 2.23 s RDD 和 ReduceByKey: 2.23 秒

GroupByKey: 30.5 s GroupByKey:30.5 秒

If you want a specific column :如果你想要一个特定的列:

import pyspark.sql.functions as F     


This is another way you can do this.这是您可以执行此操作的另一种方法。 using agg and collect :使用aggcollect

sum_number = df.agg({"Number":"sum"}).collect()[0]

result = sum_number["sum(Number)"]

Similar to other answers, but without the use of a groupby or agg.与其他答案类似,但不使用 groupby 或 agg。 I just select the column in question, sum it, collect it, and then grab the first two indices to return an int.我只是 select 有问题的列,求和,收集它,然后获取前两个索引以返回一个 int。 The only reason I chose this over the accepted answer is I am new to pyspark and was confused that the 'Number' column was not explicitly summed in the accepted answer.我选择这个而不是接受的答案的唯一原因是我是 pyspark 的新手,并且对“数字”列没有明确地汇总在接受的答案中感到困惑。 If I had to come back after sometime and try to understand what was happening, syntax such as below would be easier for me to follow.如果我必须在一段时间后回来尝试了解正在发生的事情,那么下面的语法对我来说会更容易理解。

import pyspark.sql.functions as f   


You can also try using first() function.您也可以尝试使用first()函数。 It returns the first row from the dataframe, and you can access values of respective columns using indices.它返回数据框中的第一行,您可以使用索引访问各个列的值。


In your case, the result is a dataframe with single row and column, so above snippet works.在您的情况下,结果是具有单行和单列的数据框,因此上面的代码段有效。

Select column as RDD, abuse keys() to get value in Row (or use .map(lambda x: x[0]) ), then use RDD sum: Select 列作为 RDD,滥用keys()以获取 Row 中的值(或使用.map(lambda x: x[0]) ),然后使用 RDD 和:


SQL sum using selectExpr : SQL 使用selectExpr求和:



df.groupBy().sum().rdd.map(lambda x: x[0]).collect()

sometimes read a csv file to pyspark Dataframe, maybe the numeric column change to string type '23',like this, you should use pyspark.sql.functions.sum to get the result as int , not sum()有时读取 csv 文件到 pyspark Dataframe,也许数字列更改为字符串类型 '23',像这样,您应该使用 pyspark.sql.functions.sum 将结果作为 int ,而不是 sum()

import pyspark.sql.functions as F                                                    

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM