简体   繁体   English

从列到数组Scala Spark

[英]From Column to Array Scala Spark

I am trying to apply a function on a Column in scala, but i am encountering some difficulties. 我正在尝试在scala的Column上应用函数,但是遇到了一些困难。

There is this error 有这个错误

found   : org.apache.spark.sql.Column
required: Array[Double]

Is there a way to convert a Column to an Array ? 有没有一种方法可以将Column转换为Array Thank you 谢谢

Update: 更新:

Thank you very much for your answer, I think I am getting closer to what I am trying to achieve. 非常感谢您的回答,我想我越来越接近我想要达到的目标。 I give you a little bit of more context: 我为您提供更多背景信息:

Here the code: 这里的代码:

object Targa_Indicators_Full {

  def get_quantile (variable: Array[Double], perc:Double) : Double = {
  val sorted_vec:Array[Double]=variable.sorted
  val pos:Double= Math.round(perc*variable.length)-1
  val quant:Double=sorted_vec(pos.toInt)
  quant
  }

def main(args: Array[String]): Unit = {

 val get_quantileUDF = udf(get_quantile _)

 val plate_speed = 
 trips_df.groupBy($"plate").agg(sum($"time_elapsed").alias("time"),sum($"space").alias("distance"),
 stddev_samp($"distance"/$"time_elapsed").alias("sd_speed"),
 get_quantileUDF($"distance"/$"time_elapsed",.75).alias("Quant_speed")).
 withColumn("speed", $"distance" / $"time")

}

Now I get this error: 现在我得到这个错误:

type mismatch;
[error]  found   : Double(0.75)
[error]  required: org.apache.spark.sql.Column
[error]  get_quantileUDF($"distanza"/$"tempo_intermedio",.75).alias("IQR_speed")
                                                         ^
[error] one error found

What can I do? 我能做什么? Thanks. 谢谢。

You cannot directly apply a function on the Dataframe column. 您不能直接在“数据框”列上应用函数。 You have to convert your existing function to UDF. 您必须将现有功能转换为UDF。 Spark provides user to define custom user defined functions(UDF). Spark为用户提供了定义自定义用户定义函数(UDF)的功能。

eg: You have a dataframe with array column 例如:您有一个带有数组列的数据框

scala> val df=sc.parallelize((1 to 100).toList.grouped(5).toList).toDF("value")
df: org.apache.spark.sql.DataFrame = [value: array<int>]

You have defined a function to apply on the array type column 您已经定义了要应用于数组类型列的函数

def convert( arr:Seq[Int] ) : String = {
  arr.mkString(",")
}

You have to convert this to udf before applying on the column 在将其应用于列之前,必须将其转换为udf

val convertUDF = udf(convert _)

And then you can apply your function: 然后可以应用函数:

df.withColumn("new_col", convertUDF(col("value")))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM