简体   繁体   English

如何向我的 DataFrame 添加新列,以便新列的值由 scala 中的其他一些 function 填充?

[英]How to add a new column to my DataFrame such that values of new column are populated by some other function in scala?

myFunc(Row): String = {
    //process row
    //returns string
}
appendNewCol(inputDF : DataFrame) : DataFrame ={
    inputDF.withColumn("newcol",myFunc(Row))
    inputDF
}

But no new column got created in my case.但是在我的案例中没有创建新的列。 My myFunc passes this row to a knowledgebasesession object and that returns a string after firing rules.我的myFunc将此行传递给knowledgebasesession会话 object 并在触发规则后返回一个字符串。 Can I do it this way?我可以这样做吗? If not, what is the right way?如果没有,正确的方法是什么? Thanks in advance.提前致谢。

I saw many StackOverflow solutions using expr() sqlfunc(col(udf(x)) and other techniques but here my newcol is not derived directly from existing column.我看到许多使用expr() sqlfunc(col(udf(x))和其他技术的 StackOverflow 解决方案,但这里我的newcol不是直接从现有列派生的。

Dataframe: Dataframe:

import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StringType, StructField, StructType}

val myFunc = (r: Row) => {r.getAs[String]("col1") + "xyz"} // example transformation

val testDf = spark.sparkContext.parallelize(Seq(
      (1, "abc"), (2, "def"), (3, "ghi"))).toDF("id", "col1")

testDf.show
    
val rddRes = testDf
        .rdd
        .map{x => 
          val y = myFunc (x)
          Row.fromSeq (x.toSeq ++ Seq(y) )
        }

val newSchema = StructType(testDf.schema.fields ++ Array(StructField("col2", dataType =StringType, nullable =false)))

spark.sqlContext.createDataFrame(rddRes, newSchema).show

Results:结果:

+---+----+
| id|col1|
+---+----+
|  1| abc|
|  2| def|
|  3| ghi|
+---+----+

+---+----+------+
| id|col1|  col2|
+---+----+------+
|  1| abc|abcxyz|
|  2| def|defxyz|
|  3| ghi|ghixyz|
+---+----+------+

With Dataset:使用数据集:

case class testData(id: Int, col1: String)
case class transformedData(id: Int, col1: String, col2: String)

val test: Dataset[testData] = List(testData(1, "abc"), testData(2, "def"), testData(3, "ghi")).toDS

val transformedData: Dataset[transformedData] = test
  .map { x: testData =>
     val newCol = x.col1 + "xyz"
     transformedData(x.id, x.col1, newCol)
  }

transformedData.show

As you can see datasets is more readable, plus provides strong type casting.如您所见,数据集更具可读性,并且提供了强大的类型转换。 Since I'm unaware of your spark version, providing both solutions here.由于我不知道您的 spark 版本,因此在此处提供两种解决方案。 However if you're using spark v>=1.6, you should look into Datasets.但是,如果您使用的是 spark v>=1.6,则应该查看 Datasets。 Playing with rdd is fun, but can quickly devolve into longer job runs and a host of other issues that you wont foresee玩 rdd 很有趣,但很快就会演变为更长的工作运行和许多您无法预见的其他问题

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 关于如何在 Scala 中使用随机值向现有 DataFrame 添加新列 - About how to add a new column to an existing DataFrame with random values in Scala Scala 数据框基于其他列添加新列? - Scala dataframe Add new column based on other columns? Scala通过表达式向数据框添加新列 - Scala add new column to dataframe by expression 如何使用 Scala 在 DataFrame 中添加新的可为空字符串列 - How to add a new nullable String column in a DataFrame using Scala Spark Dataframe,使用其他列添加具有功能的新列 - Spark Dataframe, add new column with function using other columns 如何使用Scala / Spark 2.2将列添加到现有DataFrame并使用window函数在新列中添加特定行 - How to add a column to the existing DataFrame and using window function to add specific rows in the new column using Scala/Spark 2.2 使用数据框中多个其他列的值将新列添加到Dataframe - spark / scala - Adding a new column to a Dataframe by using the values of multiple other columns in the dataframe - spark/scala 在Spark DataFrame中添加一个新列,其中包含一个列的所有值的总和-Scala / Spark - Add a new Column in Spark DataFrame which contains the sum of all values of one column-Scala/Spark 将具有文字值的新列添加到 Spark Scala 中 Dataframe 中的结构列 - Add new column with literal value to a struct column in Dataframe in Spark Scala 如何在数据框中添加新列并填充列? - How add a new column to in dataframe and populate the column?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM