简体   繁体   English

聚合函数计算Spark中groupBy的使用情况

[英]aggregate function Count usage with groupBy in Spark

I'm trying to make multiple operations in one line of code in pySpark, and not sure if that's possible for my case. 我试图在pySpark中的一行代码中进行多个操作,并且不确定这是否适用于我的情况。

My intention is not having to save the output as a new dataframe. 我的意图是不必将输出保存为新的数据帧。

My current code is rather simple: 我目前的代码很简单:

encodeUDF = udf(encode_time, StringType())
new_log_df.cache().withColumn('timePeriod', encodeUDF(col('START_TIME')))
  .groupBy('timePeriod')
  .agg(
    mean('DOWNSTREAM_SIZE').alias("Mean"),
    stddev('DOWNSTREAM_SIZE').alias("Stddev")
  )
  .show(20, False)

And my intention is to add count() after using groupBy , to get, well, the count of records matching each value of timePeriod column, printed\\shown as output. 我的意图是在使用groupBy之后添加count() ,以获得匹配timePeriod列的每个值的记录计数,打印\\显示为输出。

When trying to use groupBy(..).count().agg(..) I get exceptions. 当试图使用groupBy(..).count().agg(..)我得到例外。

Is there any way to achieve both count() and agg() .show() prints, without splitting code to two lines of commands, eg : 是否有任何方法可以实现count()agg() show()打印,而无需将代码拆分为两行命令,例如:

new_log_df.withColumn(..).groupBy(..).count()
new_log_df.withColumn(..).groupBy(..).agg(..).show()

Or better yet, for getting a merged output to agg.show() output - An extra column which states the counted number of records matching the row's value. 或者更好的是,将合并输出提供给agg.show()输出 - 一个额外的列,它指出与行的值匹配的计数记录数。 eg: 例如:

timePeriod | Mean | Stddev | Num Of Records
    X      | 10   |   20   |    315

count() can be used inside agg() as groupBy expression is same. count()可以在agg()内部使用,因为groupBy表达式是相同的。

With Python 用Python

import pyspark.sql.functions as func

new_log_df.cache().withColumn("timePeriod", encodeUDF(new_log_df["START_TIME"])) 
  .groupBy("timePeriod")
  .agg(
     func.mean("DOWNSTREAM_SIZE").alias("Mean"), 
     func.stddev("DOWNSTREAM_SIZE").alias("Stddev"),
     func.count(func.lit(1)).alias("Num Of Records")
   )
  .show(20, False)

pySpark SQL functions doc pySpark SQL函数doc

With Scala 随着Scala

import org.apache.spark.sql.functions._ //for count()

new_log_df.cache().withColumn("timePeriod", encodeUDF(col("START_TIME"))) 
  .groupBy("timePeriod")
  .agg(
     mean("DOWNSTREAM_SIZE").alias("Mean"), 
     stddev("DOWNSTREAM_SIZE").alias("Stddev"),
     count(lit(1)).alias("Num Of Records")
   )
  .show(20, false)

count(1) will count the records by first column which is equal to count("timePeriod") count(1)将按第一列计算记录,该列等于count("timePeriod")

With Java 用Java

import static org.apache.spark.sql.functions.*;

new_log_df.cache().withColumn("timePeriod", encodeUDF(col("START_TIME"))) 
  .groupBy("timePeriod")
  .agg(
     mean("DOWNSTREAM_SIZE").alias("Mean"), 
     stddev("DOWNSTREAM_SIZE").alias("Stddev"),
     count(lit(1)).alias("Num Of Records")
   )
  .show(20, false)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Apache Spark Group通过/聚合 - Apache Spark GroupBy / Aggregate spark SQL中的rank()函数用法 - rank() function usage in Spark SQL 如何在 Java 8 中使用 groupby 函数计算计数? - How to calculate count using groupby function in Java 8? 覆盖sql count聚合函数以休眠条件 - coverting sql count aggregate function to hibernate criteria JPA 查询聚合函数 COUNT 与两个表 - JPA Queries Aggregate function COUNT with two tables Java Hibernate JPQL查询(汇总Function:计数) - Java Hibernate JPQL Query (Aggregate Function: count) Spark 2.2.0 API:我应该更喜欢使用Groupby结合Aggregate的Dataset或使用ReduceBykey结合RDD的数据集 - Spark 2.2.0 API: Which one should i prefer Dataset with Groupby combined with aggregate or RDD with ReduceBykey 如何使用 datastore.aggregate() 在最新的 morphia 版本 2.3 中计算 groupby 中的 count(*)? - How to calculate count(*) in groupby in latest morphia version 2.3 using datastore.aggregate()? 火花与列和聚合 function 删除数据集中的其他列 - spark with column and aggregate function dropping other columns in the dataset 使用 Spark 的 MapReduce 调用不同的函数并聚合 - Using Spark's MapReduce to call a different function and aggregate
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM