Scala Spark DataFrame SQL withColumn-如何使用函数（x：String）进行转换

Question

我的目标是将列添加到现有DataFrame中，并使用DF中现有列的转换来填充列。

我发现所有示例都使用withColumn添加列和when（）。otherwise（）进行转换。

我希望使用带有区分大小写的已定义函数（x：String），这使我可以使用字符串函数并应用更复杂的转换。

样本数据框

val etldf = Seq(   
            ("Total, 20 to 24 years            "),
            ("Men, 20 to 24 years              "),
            ("Women, 20 to 24 years            ")).toDF("A")

使用when（）。otherwise（）应用简单的转换。 我可以将这些嵌套在一起，但是很快就会变得凌乱。

val newcol = when($"A".contains("Men"), "Male").
  otherwise(when($"A".contains("Women"), "Female").
  otherwise("Both"))
val newdf = etldf.withColumn("NewCol", newcol)      
newdf.select("A","NewCol").show(100, false)

输出如下：

+---------------------------------+------+
|A                                |NewCol|
+---------------------------------+------+
|Total, 20 to 24 years            |Both  |
|Men, 20 to 24 years              |Male  |
|Women, 20 to 24 years            |Female|
+---------------------------------+------+

但是可以说我想要一个稍微复杂的转换：

val newcol = when($"A".contains("Total") && $"A".contains("years"), $"A".indexOf("to").toString())

它不喜欢这样，因为indexOf是一个String函数，而不是ColumnName的成员。

我真正想做的是定义一个可以实现非常复杂的转换并将其传递给withColumn（）的函数：

 def AtoNewCol( A : String): String = A match {
   case a if a.contains("Men") => "Male"
   case a if a.contains("Women") => "Female"
   case a if a.contains("Total") && a.contains("years") => a.indexOf("to").toString()
   case other => "Both"
 }
 AtoNewCol("Total, 20 to 24 years            ")

输出结果为10（位置为“ to”）

但是我面临相同类型的不匹配：withColumn（）需要一个ColumnName对象：

scala> val newdf = etldf.withColumn("NewCol", AtoNewCol($"A"))
<console>:33: error: type mismatch;
found   : org.apache.spark.sql.ColumnName
required: String
val newdf = etldf.withColumn("NewCol", AtoNewCol($"A"))
                                                    ^

如果更改AtoNewCol（A：org.apache.spark.sql.ColumnName）的签名，则在实现中会遇到相同的问题：

scala>  def AtoNewCol( A : org.apache.spark.sql.ColumnName): String = A 
match {
 |     case a if a.contains("Men") => "Male"
 |     case a if a.contains("Women") => "Female"
 |     case a if a.contains("Total") && a.contains("years") => a.indexOf("to").toString()
 |     case other => "Both"
 |   }
<console>:30: error: type mismatch;
found   : org.apache.spark.sql.Column
required: Boolean
       case a if a.contains("Men") => "Male"
                           ^
.
.
.
etc.

我希望有一种语法可以将列的值绑定到函数。

也许除了withColum（）之外，还有其他功能可以为转换定义更复杂的功能。

公开所有建议。

Answer 1

您只需要一个udf函数

import org.apache.spark.sql.functions._
def AtoNewCol = udf(( A : String) => A match {
  case a if a.contains("Men") => "Male"
  case a if a.contains("Women") => "Female"
  case a if a.contains("Total") && a.contains("years") => a.indexOf("to").toString()
  case other => "Both"
})

etldf.withColumn("NewCol", AtoNewCol($"A")).show(false)

你应该得到

+---------------------------------+------+
|A                                |NewCol|
+---------------------------------+------+
|Total, 20 to 24 years            |10    |
|Men, 20 to 24 years              |Male  |
|Women, 20 to 24 years            |Female|
+---------------------------------+------+

udf函数逐行工作，对数据的操作发生在原始数据类型上，而不是像其他内置函数那样按列进行

Answer 2

您需要为此创建UDF，可以尝试以下操作。 我正在使用您定义的函数。

def AtoNewCol = udf((A: String) => {
  A match {
    case a if a.contains("Men") => "Male"
    case a if a.contains("Women") => "Female"
    case a if a.contains("Total") && a.contains("years") => a.indexOf("to").toString
    case other => "Both"
  }
})

etldf.withColumn("NewCol", AtoNewCol($"A")).show(false)

//    output
//    +---------------------------------+------+
//    |A                                |NewCol|
//    +---------------------------------+------+
//    |Total, 20 to 24 years            |10    | 
//    |Men, 20 to 24 years              |Male  |
//    |Women, 20 to 24 years            |Female|
//    +---------------------------------+------+

Scala Spark DataFrame SQL withColumn-如何使用函数（x：String）进行转换

问题描述

2 个解决方案

解决方案1
3 2018-04-03 04:37:23

解决方案2
2 已采纳 2018-04-03 04:38:17

Scala Spark DataFrame SQL withColumn-如何使用函数（x：String）进行转换

问题描述

2 个解决方案

解决方案1 3 2018-04-03 04:37:23

解决方案2 2 已采纳 2018-04-03 04:38:17

解决方案1
3 2018-04-03 04:37:23

解决方案2
2 已采纳 2018-04-03 04:38:17