使用 Spark / SCALA 计算平均值和标准偏差

Question

我有一个数据框：

+------------------+
|         speed    |
+------------------+
|               0.0|
|               0.0|
|               0.0|
|               0.0|
|               0.0|
|               0.0|
| 3.851015222867941|
| 4.456657435740331|
|               0.0|
|               NaN|
|               0.0|
|               0.0|
|               NaN|
|               0.0|
|               0.0|
| 5.424094717765175|
|1.5781185921913181|
|2.6695439462433033|
| 17.43513658955467|
| 5.440912941359523|
|11.507138536880484|
|12.895677610360089|
| 9.930875909722456|
+------------------+

我想计算 speed 列的均值和标准差。 当我执行此代码时

dataframe_final.select("speed").orderBy("id").agg(avg("speed")).show(1000)

我得到

+------------+
|avg(speed)|
+------------+
|         NaN|
+------------+

问题从何而来？ 有没有可能解决它？

谢谢

Answer 1

您的数据集中有NaN （非数字）值。 你不能用这些计算平均值。

要么过滤它们：


dataframe_final
  .filter($"speed".isNotNull())
  .select("speed")
  .orderBy("id")
  .agg(avg("speed"))
  .show(1000)

或者使用fill函数将它们替换为0 ：

dataframe_final
  .select("speed")
  .na.fill(0)
  .agg(avg("speed"))
  .show(1000)

此外，您正在尝试聚合Vitesse列而不是speed 。

Answer 2

we can also createOrReplaceTempView(dataframe_final) and then we can use spark sql to query and take avg of the speed column

val tableview= dataframe_final.createOrReplaceTempView()
val query = select avg(speed) from tableview where speed IS NOT NULL order by Id
spark.sql(query).show()

使用 Spark / SCALA 计算平均值和标准偏差

问题描述

2 个解决方案

解决方案1
3 已采纳 2020-03-25 09:46:09

解决方案2
1 2020-03-25 14:25:24

使用 Spark / SCALA 计算平均值和标准偏差

问题描述

2 个解决方案

解决方案1 3 已采纳 2020-03-25 09:46:09

解决方案2 1 2020-03-25 14:25:24

解决方案1
3 已采纳 2020-03-25 09:46:09

解决方案2
1 2020-03-25 14:25:24