[英]Scala +Spark+Dataframe Exception When i try to dynamically cast a column and assign sorting order
I want to sort my dataframe using selected columns by casting them from stringtype to prederred type and prederred order. 我想使用选定的列对数据帧进行排序,方法是将它们从stringtype转换为prederred类型和prederred顺序。 But even a simple casting of a column doesn't work and giving this exception.
但即使是一个简单的列转换也不起作用并给出了这个例外。 I am providing the sample code here.
我在这里提供示例代码。
val conf = new SparkConf().setAppName("Sparkify").setMaster("local[*]")
val sparkContext =new SparkContext(conf)
val sqlContext = new SQLContext(sparkContext)
var df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load("example-data.csv")
// val colsToSort= List("age")
df = df.sort( df.col("age").cast(IntegerType).desc)
df.show()
sparkContext.stop()
The simple csv looks like this 简单的csv看起来像这样
+-----+---+---+
| name|sex|age|
+-----+---+---+
|Alice| f| 34|
| Bob| m| 63|
|Alice| f| 14|
| Bob| m| 6|
+-----+---+---+
The detailed exception stack. 详细的异常堆栈。
Exception in thread "main" org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to dataType on unresolved object, tree: unresolvedalias(cast(cast(age#2 as decimal(20,0)) as int))
at org.apache.spark.sql.catalyst.analysis.UnresolvedAlias.dataType(unresolved.scala:295)
at org.apache.spark.sql.catalyst.expressions.SortOrder.dataType(SortOrder.scala:49)
at org.apache.spark.sql.catalyst.expressions.SortOrder.checkInputDataTypes(SortOrder.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:62)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107)
at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:121)
What am i doing wrong? 我究竟做错了什么? Or what is the best way to dynamically declare a sorting expression based on multiple columns, with type casting and ordering.
或者,基于多个列动态声明排序表达式的最佳方法是什么,使用类型转换和排序。
Can anyone help me with this Please? 请有人帮我这个吗?
You can't create the new column within the sort
function -- you have to do it before the sort
. 您无法在
sort
函数中创建新列 - 您必须在sort
之前执行此操作。 Try something like this instead: 尝试这样的事情:
df.withColumn("age", $"age".cast(IntegerType)).sort($"age".desc)
If you want to use a variable for the column name, try this: 如果要将变量用于列名,请尝试以下操作:
val colName = "age"
df.withColumn(colName, col(colName).cast(IntegerType)).sort(col(colName).desc)
Note that, unless you are joining two tables with the same-named column, you can generally leave off the df.
请注意,除非您使用同名列连接两个表,否则通常可以不使用
df.
in front of col(...)
. 在
col(...)
前面。 You could even create the sort
column before you create the df
and it works fine: 您甚至可以在创建
df
之前创建sort
列,它可以正常工作:
val sortCol = col("age").desc
val df = ...
df.sort(sortCol)
In that way you can easily apply the same sort
to multiple DataFrames
通过这种方式,您可以轻松地将相同的
sort
应用于多个DataFrames
If you want to cast more than one column, you could just do: 如果您想要投射多个列,您可以这样做:
df.withColumn(...).withColumn(...).withColumn(...)
Where each of the withColumn
is a cast expression. 其中每个
withColumn
都是一个withColumn
表达式。 Or you could do: 或者你可以这样做:
df.select($"age".cast(IntegerType), $"otherCol".cast(DoubleType), ...)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.