如何向我的 DataFrame 添加新列，以便新列的值由 scala 中的其他一些 function 填充？

Question

myFunc(Row): String = {
    //process row
    //returns string
}
appendNewCol(inputDF : DataFrame) : DataFrame ={
    inputDF.withColumn("newcol",myFunc(Row))
    inputDF
}

但是在我的案例中没有创建新的列。 我的myFunc将此行传递给knowledgebasesession会话 object 并在触发规则后返回一个字符串。 我可以这样做吗？ 如果没有，正确的方法是什么？ 提前致谢。

我看到许多使用expr() sqlfunc(col(udf(x))和其他技术的 StackOverflow 解决方案，但这里我的newcol不是直接从现有列派生的。

Answer 1

Dataframe：

import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StringType, StructField, StructType}

val myFunc = (r: Row) => {r.getAs[String]("col1") + "xyz"} // example transformation

val testDf = spark.sparkContext.parallelize(Seq(
      (1, "abc"), (2, "def"), (3, "ghi"))).toDF("id", "col1")

testDf.show
    
val rddRes = testDf
        .rdd
        .map{x => 
          val y = myFunc (x)
          Row.fromSeq (x.toSeq ++ Seq(y) )
        }

val newSchema = StructType(testDf.schema.fields ++ Array(StructField("col2", dataType =StringType, nullable =false)))

spark.sqlContext.createDataFrame(rddRes, newSchema).show

结果：

+---+----+
| id|col1|
+---+----+
|  1| abc|
|  2| def|
|  3| ghi|
+---+----+

+---+----+------+
| id|col1|  col2|
+---+----+------+
|  1| abc|abcxyz|
|  2| def|defxyz|
|  3| ghi|ghixyz|
+---+----+------+

使用数据集：

case class testData(id: Int, col1: String)
case class transformedData(id: Int, col1: String, col2: String)

val test: Dataset[testData] = List(testData(1, "abc"), testData(2, "def"), testData(3, "ghi")).toDS

val transformedData: Dataset[transformedData] = test
  .map { x: testData =>
     val newCol = x.col1 + "xyz"
     transformedData(x.id, x.col1, newCol)
  }

transformedData.show

如您所见，数据集更具可读性，并且提供了强大的类型转换。 由于我不知道您的 spark 版本，因此在此处提供两种解决方案。 但是，如果您使用的是 spark v>=1.6，则应该查看 Datasets。 玩 rdd 很有趣，但很快就会演变为更长的工作运行和许多您无法预见的其他问题

如何向我的 DataFrame 添加新列，以便新列的值由 scala 中的其他一些 function 填充？

问题描述

1 个解决方案

解决方案1
0 已采纳 2021-03-03 04:38:38

如何向我的 DataFrame 添加新列，以便新列的值由 scala 中的其他一些 function 填充？

问题描述

1 个解决方案

解决方案1 0 已采纳 2021-03-03 04:38:38

解决方案1
0 已采纳 2021-03-03 04:38:38