[英]aggregate function Count usage with groupBy in Spark
I'm trying to make multiple operations in one line of code in pySpark, and not sure if that's possible for my case. 我试图在pySpark中的一行代码中进行多个操作,并且不确定这是否适用于我的情况。
My intention is not having to save the output as a new dataframe. 我的意图是不必将输出保存为新的数据帧。
My current code is rather simple: 我目前的代码很简单:
encodeUDF = udf(encode_time, StringType())
new_log_df.cache().withColumn('timePeriod', encodeUDF(col('START_TIME')))
.groupBy('timePeriod')
.agg(
mean('DOWNSTREAM_SIZE').alias("Mean"),
stddev('DOWNSTREAM_SIZE').alias("Stddev")
)
.show(20, False)
And my intention is to add count()
after using groupBy
, to get, well, the count of records matching each value of timePeriod
column, printed\\shown as output. 我的意图是在使用
groupBy
之后添加count()
,以获得匹配timePeriod
列的每个值的记录计数,打印\\显示为输出。
When trying to use groupBy(..).count().agg(..)
I get exceptions. 当试图使用
groupBy(..).count().agg(..)
我得到例外。
Is there any way to achieve both count()
and agg()
.show() prints, without splitting code to two lines of commands, eg : 是否有任何方法可以实现
count()
和agg()
show()打印,而无需将代码拆分为两行命令,例如:
new_log_df.withColumn(..).groupBy(..).count()
new_log_df.withColumn(..).groupBy(..).agg(..).show()
Or better yet, for getting a merged output to agg.show()
output - An extra column which states the counted number of records matching the row's value. 或者更好的是,将合并输出提供给
agg.show()
输出 - 一个额外的列,它指出与行的值匹配的计数记录数。 eg: 例如:
timePeriod | Mean | Stddev | Num Of Records
X | 10 | 20 | 315
count()
can be used inside agg()
as groupBy
expression is same. count()
可以在agg()
内部使用,因为groupBy
表达式是相同的。
import pyspark.sql.functions as func
new_log_df.cache().withColumn("timePeriod", encodeUDF(new_log_df["START_TIME"]))
.groupBy("timePeriod")
.agg(
func.mean("DOWNSTREAM_SIZE").alias("Mean"),
func.stddev("DOWNSTREAM_SIZE").alias("Stddev"),
func.count(func.lit(1)).alias("Num Of Records")
)
.show(20, False)
pySpark SQL functions doc pySpark SQL函数doc
import org.apache.spark.sql.functions._ //for count()
new_log_df.cache().withColumn("timePeriod", encodeUDF(col("START_TIME")))
.groupBy("timePeriod")
.agg(
mean("DOWNSTREAM_SIZE").alias("Mean"),
stddev("DOWNSTREAM_SIZE").alias("Stddev"),
count(lit(1)).alias("Num Of Records")
)
.show(20, false)
count(1)
will count the records by first column which is equal to count("timePeriod")
count(1)
将按第一列计算记录,该列等于count("timePeriod")
import static org.apache.spark.sql.functions.*;
new_log_df.cache().withColumn("timePeriod", encodeUDF(col("START_TIME")))
.groupBy("timePeriod")
.agg(
mean("DOWNSTREAM_SIZE").alias("Mean"),
stddev("DOWNSTREAM_SIZE").alias("Stddev"),
count(lit(1)).alias("Num Of Records")
)
.show(20, false)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.