简体   繁体   English

Scala + Spark + Dataframe异常当我尝试动态转换列并分配排序顺序时

[英]Scala +Spark+Dataframe Exception When i try to dynamically cast a column and assign sorting order

I want to sort my dataframe using selected columns by casting them from stringtype to prederred type and prederred order. 我想使用选定的列对数据帧进行排序,方法是将它们从stringtype转换为prederred类型和prederred顺序。 But even a simple casting of a column doesn't work and giving this exception. 但即使是一个简单的列转换也不起作用并给出了这个例外。 I am providing the sample code here. 我在这里提供示例代码。

    val conf = new SparkConf().setAppName("Sparkify").setMaster("local[*]")
    val sparkContext =new SparkContext(conf)
    val sqlContext = new SQLContext(sparkContext)
    var df =  sqlContext.read
      .format("com.databricks.spark.csv")
      .option("header", "true")
      .load("example-data.csv")
//    val colsToSort= List("age")
    df = df.sort( df.col("age").cast(IntegerType).desc)
    df.show()
    sparkContext.stop()

The simple csv looks like this 简单的csv看起来像这样

+-----+---+---+
| name|sex|age|
+-----+---+---+
|Alice|  f| 34|
|  Bob|  m| 63|
|Alice|  f| 14|
|  Bob|  m|  6|
+-----+---+---+

The detailed exception stack. 详细的异常堆栈。

Exception in thread "main" org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to dataType on unresolved object, tree: unresolvedalias(cast(cast(age#2 as decimal(20,0)) as int))
    at org.apache.spark.sql.catalyst.analysis.UnresolvedAlias.dataType(unresolved.scala:295)
    at org.apache.spark.sql.catalyst.expressions.SortOrder.dataType(SortOrder.scala:49)
    at org.apache.spark.sql.catalyst.expressions.SortOrder.checkInputDataTypes(SortOrder.scala:42)
    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:62)
    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318)
    at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107)
    at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117)
    at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:121)

What am i doing wrong? 我究竟做错了什么? Or what is the best way to dynamically declare a sorting expression based on multiple columns, with type casting and ordering. 或者,基于多个列动态声明排序表达式的最佳方法是什么,使用类型转换和排序。

Can anyone help me with this Please? 请有人帮我这个吗?

You can't create the new column within the sort function -- you have to do it before the sort . 您无法在sort函数中创建新列 - 您必须在sort之前执行此操作。 Try something like this instead: 尝试这样的事情:

df.withColumn("age", $"age".cast(IntegerType)).sort($"age".desc)

If you want to use a variable for the column name, try this: 如果要将变量用于列名,请尝试以下操作:

val colName = "age"
df.withColumn(colName, col(colName).cast(IntegerType)).sort(col(colName).desc)

Note that, unless you are joining two tables with the same-named column, you can generally leave off the df. 请注意,除非您使用同名列连接两个表,否则通常可以不使用df. in front of col(...) . col(...)前面。 You could even create the sort column before you create the df and it works fine: 您甚至可以在创建df之前创建sort列,它可以正常工作:

val sortCol = col("age").desc
val df = ...
df.sort(sortCol)

In that way you can easily apply the same sort to multiple DataFrames 通过这种方式,您可以轻松地将相同的sort应用于多个DataFrames

If you want to cast more than one column, you could just do: 如果您想要投射多个列,您可以这样做:

df.withColumn(...).withColumn(...).withColumn(...)

Where each of the withColumn is a cast expression. 其中每个withColumn都是一个withColumn表达式。 Or you could do: 或者你可以这样做:

df.select($"age".cast(IntegerType), $"otherCol".cast(DoubleType), ...)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM