如何向我的 DataFrame 添加新列，以便新列的值由 scala 中的其他一些 function 填充？

Question

myFunc(Row): String = {
    //process row
    //returns string
}
appendNewCol(inputDF : DataFrame) : DataFrame ={
    inputDF.withColumn("newcol",myFunc(Row))
    inputDF
}

但是在我的案例中沒有創建新的列。 我的myFunc將此行傳遞給knowledgebasesession會話 object 並在觸發規則后返回一個字符串。 我可以這樣做嗎？ 如果沒有，正確的方法是什么？ 提前致謝。

我看到許多使用expr() sqlfunc(col(udf(x))和其他技術的 StackOverflow 解決方案，但這里我的newcol不是直接從現有列派生的。

Answer 1

Dataframe：

import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StringType, StructField, StructType}

val myFunc = (r: Row) => {r.getAs[String]("col1") + "xyz"} // example transformation

val testDf = spark.sparkContext.parallelize(Seq(
      (1, "abc"), (2, "def"), (3, "ghi"))).toDF("id", "col1")

testDf.show
    
val rddRes = testDf
        .rdd
        .map{x => 
          val y = myFunc (x)
          Row.fromSeq (x.toSeq ++ Seq(y) )
        }

val newSchema = StructType(testDf.schema.fields ++ Array(StructField("col2", dataType =StringType, nullable =false)))

spark.sqlContext.createDataFrame(rddRes, newSchema).show

結果：

+---+----+
| id|col1|
+---+----+
|  1| abc|
|  2| def|
|  3| ghi|
+---+----+

+---+----+------+
| id|col1|  col2|
+---+----+------+
|  1| abc|abcxyz|
|  2| def|defxyz|
|  3| ghi|ghixyz|
+---+----+------+

使用數據集：

case class testData(id: Int, col1: String)
case class transformedData(id: Int, col1: String, col2: String)

val test: Dataset[testData] = List(testData(1, "abc"), testData(2, "def"), testData(3, "ghi")).toDS

val transformedData: Dataset[transformedData] = test
  .map { x: testData =>
     val newCol = x.col1 + "xyz"
     transformedData(x.id, x.col1, newCol)
  }

transformedData.show

如您所見，數據集更具可讀性，並且提供了強大的類型轉換。 由於我不知道您的 spark 版本，因此在此處提供兩種解決方案。 但是，如果您使用的是 spark v>=1.6，則應該查看 Datasets。 玩 rdd 很有趣，但很快就會演變為更長的工作運行和許多您無法預見的其他問題

如何向我的 DataFrame 添加新列，以便新列的值由 scala 中的其他一些 function 填充？

問題描述

1 個解決方案

解決方案1
0 已采納 2021-03-03 04:38:38

如何向我的 DataFrame 添加新列，以便新列的值由 scala 中的其他一些 function 填充？

問題描述

1 個解決方案

解決方案1 0 已采納 2021-03-03 04:38:38

解決方案1
0 已采納 2021-03-03 04:38:38