聚合函数计算Spark中groupBy的使用情况

Question

I'm trying to make multiple operations in one line of code in pySpark, and not sure if that's possible for my case. 我试图在pySpark中的一行代码中进行多个操作，并且不确定这是否适用于我的情况。

My intention is not having to save the output as a new dataframe. 我的意图是不必将输出保存为新的数据帧。

My current code is rather simple: 我目前的代码很简单：

encodeUDF = udf(encode_time, StringType())
new_log_df.cache().withColumn('timePeriod', encodeUDF(col('START_TIME')))
  .groupBy('timePeriod')
  .agg(
    mean('DOWNSTREAM_SIZE').alias("Mean"),
    stddev('DOWNSTREAM_SIZE').alias("Stddev")
  )
  .show(20, False)

And my intention is to add count() after using groupBy , to get, well, the count of records matching each value of timePeriod column, printed\\shown as output. 我的意图是在使用groupBy之后添加count() ，以获得匹配timePeriod列的每个值的记录计数，打印\\显示为输出。

When trying to use groupBy(..).count().agg(..) I get exceptions. 当试图使用groupBy(..).count().agg(..)我得到例外。

Is there any way to achieve both count() and agg() .show() prints, without splitting code to two lines of commands, eg : 是否有任何方法可以实现count()和agg() show（）打印，而无需将代码拆分为两行命令，例如：

new_log_df.withColumn(..).groupBy(..).count()
new_log_df.withColumn(..).groupBy(..).agg(..).show()

Or better yet, for getting a merged output to agg.show() output - An extra column which states the counted number of records matching the row's value. 或者更好的是，将合并输出提供给agg.show()输出 - 一个额外的列，它指出与行的值匹配的计数记录数。 eg: 例如：

timePeriod | Mean | Stddev | Num Of Records
    X      | 10   |   20   |    315

Answer 1

count() can be used inside agg() as groupBy expression is same. count()可以在agg()内部使用，因为groupBy表达式是相同的。

With Python 用Python

import pyspark.sql.functions as func

new_log_df.cache().withColumn("timePeriod", encodeUDF(new_log_df["START_TIME"])) 
  .groupBy("timePeriod")
  .agg(
     func.mean("DOWNSTREAM_SIZE").alias("Mean"), 
     func.stddev("DOWNSTREAM_SIZE").alias("Stddev"),
     func.count(func.lit(1)).alias("Num Of Records")
   )
  .show(20, False)

pySpark SQL functions doc pySpark SQL函数doc

With Scala 随着Scala

import org.apache.spark.sql.functions._ //for count()

new_log_df.cache().withColumn("timePeriod", encodeUDF(col("START_TIME"))) 
  .groupBy("timePeriod")
  .agg(
     mean("DOWNSTREAM_SIZE").alias("Mean"), 
     stddev("DOWNSTREAM_SIZE").alias("Stddev"),
     count(lit(1)).alias("Num Of Records")
   )
  .show(20, false)

count(1) will count the records by first column which is equal to count("timePeriod") count(1)将按第一列计算记录，该列等于count("timePeriod")

With Java 用Java

import static org.apache.spark.sql.functions.*;

new_log_df.cache().withColumn("timePeriod", encodeUDF(col("START_TIME"))) 
  .groupBy("timePeriod")
  .agg(
     mean("DOWNSTREAM_SIZE").alias("Mean"), 
     stddev("DOWNSTREAM_SIZE").alias("Stddev"),
     count(lit(1)).alias("Num Of Records")
   )
  .show(20, false)

聚合函数计算Spark中groupBy的使用情况

问题描述

1 个解决方案

解决方案1
52 已采纳 2017-01-27 09:38:36

With Python 用Python

With Scala 随着Scala

With Java 用Java

聚合函数计算Spark中groupBy的使用情况

问题描述

1 个解决方案

解决方案1 52 已采纳 2017-01-27 09:38:36

With Python 用Python

With Scala 随着Scala

With Java 用Java

解决方案1
52 已采纳 2017-01-27 09:38:36