如何向我的 DataFrame 添加新列，以便新列的值由 scala 中的其他一些 function 填充？

Question

myFunc(Row): String = {
    //process row
    //returns string
}
appendNewCol(inputDF : DataFrame) : DataFrame ={
    inputDF.withColumn("newcol",myFunc(Row))
    inputDF
}

But no new column got created in my case.但是在我的案例中没有创建新的列。 My myFunc passes this row to a knowledgebasesession object and that returns a string after firing rules.我的myFunc将此行传递给knowledgebasesession会话 object 并在触发规则后返回一个字符串。 Can I do it this way?我可以这样做吗？ If not, what is the right way?如果没有，正确的方法是什么？ Thanks in advance.提前致谢。

I saw many StackOverflow solutions using expr() sqlfunc(col(udf(x)) and other techniques but here my newcol is not derived directly from existing column.我看到许多使用expr() sqlfunc(col(udf(x))和其他技术的 StackOverflow 解决方案，但这里我的newcol不是直接从现有列派生的。

Answer 1

Dataframe: Dataframe：

import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StringType, StructField, StructType}

val myFunc = (r: Row) => {r.getAs[String]("col1") + "xyz"} // example transformation

val testDf = spark.sparkContext.parallelize(Seq(
      (1, "abc"), (2, "def"), (3, "ghi"))).toDF("id", "col1")

testDf.show
    
val rddRes = testDf
        .rdd
        .map{x => 
          val y = myFunc (x)
          Row.fromSeq (x.toSeq ++ Seq(y) )
        }

val newSchema = StructType(testDf.schema.fields ++ Array(StructField("col2", dataType =StringType, nullable =false)))

spark.sqlContext.createDataFrame(rddRes, newSchema).show

Results:结果：

+---+----+
| id|col1|
+---+----+
|  1| abc|
|  2| def|
|  3| ghi|
+---+----+

+---+----+------+
| id|col1|  col2|
+---+----+------+
|  1| abc|abcxyz|
|  2| def|defxyz|
|  3| ghi|ghixyz|
+---+----+------+

With Dataset:使用数据集：

case class testData(id: Int, col1: String)
case class transformedData(id: Int, col1: String, col2: String)

val test: Dataset[testData] = List(testData(1, "abc"), testData(2, "def"), testData(3, "ghi")).toDS

val transformedData: Dataset[transformedData] = test
  .map { x: testData =>
     val newCol = x.col1 + "xyz"
     transformedData(x.id, x.col1, newCol)
  }

transformedData.show

As you can see datasets is more readable, plus provides strong type casting.如您所见，数据集更具可读性，并且提供了强大的类型转换。 Since I'm unaware of your spark version, providing both solutions here.由于我不知道您的 spark 版本，因此在此处提供两种解决方案。 However if you're using spark v>=1.6, you should look into Datasets.但是，如果您使用的是 spark v>=1.6，则应该查看 Datasets。 Playing with rdd is fun, but can quickly devolve into longer job runs and a host of other issues that you wont foresee玩 rdd 很有趣，但很快就会演变为更长的工作运行和许多您无法预见的其他问题

如何向我的 DataFrame 添加新列，以便新列的值由 scala 中的其他一些 function 填充？

问题描述

1 个解决方案

解决方案1
0 已采纳 2021-03-03 04:38:38

如何向我的 DataFrame 添加新列，以便新列的值由 scala 中的其他一些 function 填充？

问题描述

1 个解决方案

解决方案1 0 已采纳 2021-03-03 04:38:38

解决方案1
0 已采纳 2021-03-03 04:38:38