[英]Spark Scala - How do I iterate rows in dataframe, and add calculated values as new columns of the data frame
我有一个包含两列“date”和“value”的数据框,如何在数据框中添加2个新列“value_mean”和“value_sd”,其中“value_mean”是过去10天内“value”的平均值(包括“date”中指定的当前日期和“value_sd”是过去10天内“值”的标准差?
Spark sql provide the various dataframe function like avg,mean,sum etc. Spark sql提供各种数据帧功能,如avg,mean,sum等。
you just have to apply on dataframe column using spark sql column 你只需要使用spark sql列应用于dataframe 列
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column
Create private method for standard deviation 为标准偏差创建私有方法
private def stddev(col: Column): Column = sqrt(avg(col * col) - avg(col) * avg(col))
Now you can create sql Column for average and standard deviation 现在您可以为平均值和标准差创建sql列
val value_sd: org.apache.spark.sql.Column = stddev(df.col("value")).as("value_sd")
val value_mean: org.apache.spark.sql.Column = avg(df.col("value").as("value_mean"))
Filter your dataframe for last 10 days or as you want 过滤您的数据帧最近10天或根据需要
val filterDF=df.filter("")//put your filter condition
Now yon can apply the aggregate function on your filterDF 现在你可以在你的filterDF上应用聚合函数
filterDF.agg(stdv, value_mean).show
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.