简体   繁体   English

spark scala 尝试获取最大值时“重载方法值 select 和替代项”

[英]spark scala "Overloaded method value select with alternatives" when trying to get the max value

df.show
+-------+-----+----------+--------+------------+----+
|     id|  val|      date|    time| use        |flag|
+-------+-----+----------+--------+------------+----+
|8200732|    1|2015-01-06|11:48:30|30065.221532|   0|
|8200733|    1|2015-01-06|11:48:40|30065.225763|   0|
|8200734|    1|2015-01-06|11:48:50|30065.229994|   0|
|8200735|    1|2015-01-06|11:49:00|30065.234225|   0|

I am trying to get the average use for each date value.我正在尝试获取每个日期值的平均使用情况。 Here is what I try:这是我的尝试:

 df.select("date",max($"use")).show()
<console>:26: error: overloaded method value select with alternatives:
  [U1, U2](c1: org.apache.spark.sql.TypedColumn[org.apache.spark.sql.Row,U1], c2: org.apache.spark.sql.TypedColumn[org.apache.spark.sql.Row,U2])org.apache.spark.sql.Dataset[(U1, U2)] <and>
  (col: String,cols: String*)org.apache.spark.sql.DataFrame <and>
  (cols: org.apache.spark.sql.Column*)org.apache.spark.sql.DataFrame
 cannot be applied to (String, org.apache.spark.sql.Column)

I am not sure what I am doing wrong, I have tried to re-write this many times but each time I get an error.我不确定自己做错了什么,我曾多次尝试重写,但每次都出现错误。 I can get the max value for just the use but trying to get the max value of use for each date is causing me issues.我可以获得仅使用的最大值,但尝试获取每个日期的最大使用值给我带来了问题。

I can not use SparkSQL or pySpark for this.我不能为此使用 SparkSQL 或 pySpark。

That's because you're not using either of the overloaded options for select method on dataframe. The one that you're using is:那是因为您没有在 dataframe 上使用select方法的任何一个重载选项。您使用的是:

df.select("date",max($"use")).show()

And if you notice, "date" is a String literal, while max($"user") is a Column .如果您注意到, "date"是一个 String 文字,而max($"user")是一个Column You should try to use the date column instead of literal date string:您应该尝试使用日期列而不是文字日期字符串:

// notice the $ before date here
df.select($"date",max($"use")).show()

Here is what you should do to get the average use for each date value:以下是获取每个日期值的平均使用量应该执行的操作:

df.groupBy("date").agg(mean("use")).show()

To get max like in the question:要获得问题中的最大值:

    df.groupBy("date").agg(max("use")).show()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM