[英]How to add a new column to my DataFrame such that values of new column are populated by some other function in scala?
myFunc(Row): String = {
//process row
//returns string
}
appendNewCol(inputDF : DataFrame) : DataFrame ={
inputDF.withColumn("newcol",myFunc(Row))
inputDF
}
But no new column got created in my case.但是在我的案例中没有创建新的列。 My
myFunc
passes this row to a knowledgebasesession
object and that returns a string after firing rules.我的
myFunc
将此行传递给knowledgebasesession
会话 object 并在触发规则后返回一个字符串。 Can I do it this way?我可以这样做吗? If not, what is the right way?
如果没有,正确的方法是什么? Thanks in advance.
提前致谢。
I saw many StackOverflow solutions using expr() sqlfunc(col(udf(x))
and other techniques but here my newcol
is not derived directly from existing column.我看到许多使用
expr() sqlfunc(col(udf(x))
和其他技术的 StackOverflow 解决方案,但这里我的newcol
不是直接从现有列派生的。
Dataframe: Dataframe:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StringType, StructField, StructType}
val myFunc = (r: Row) => {r.getAs[String]("col1") + "xyz"} // example transformation
val testDf = spark.sparkContext.parallelize(Seq(
(1, "abc"), (2, "def"), (3, "ghi"))).toDF("id", "col1")
testDf.show
val rddRes = testDf
.rdd
.map{x =>
val y = myFunc (x)
Row.fromSeq (x.toSeq ++ Seq(y) )
}
val newSchema = StructType(testDf.schema.fields ++ Array(StructField("col2", dataType =StringType, nullable =false)))
spark.sqlContext.createDataFrame(rddRes, newSchema).show
Results:结果:
+---+----+
| id|col1|
+---+----+
| 1| abc|
| 2| def|
| 3| ghi|
+---+----+
+---+----+------+
| id|col1| col2|
+---+----+------+
| 1| abc|abcxyz|
| 2| def|defxyz|
| 3| ghi|ghixyz|
+---+----+------+
With Dataset:使用数据集:
case class testData(id: Int, col1: String)
case class transformedData(id: Int, col1: String, col2: String)
val test: Dataset[testData] = List(testData(1, "abc"), testData(2, "def"), testData(3, "ghi")).toDS
val transformedData: Dataset[transformedData] = test
.map { x: testData =>
val newCol = x.col1 + "xyz"
transformedData(x.id, x.col1, newCol)
}
transformedData.show
As you can see datasets is more readable, plus provides strong type casting.如您所见,数据集更具可读性,并且提供了强大的类型转换。 Since I'm unaware of your spark version, providing both solutions here.
由于我不知道您的 spark 版本,因此在此处提供两种解决方案。 However if you're using spark v>=1.6, you should look into Datasets.
但是,如果您使用的是 spark v>=1.6,则应该查看 Datasets。 Playing with rdd is fun, but can quickly devolve into longer job runs and a host of other issues that you wont foresee
玩 rdd 很有趣,但很快就会演变为更长的工作运行和许多您无法预见的其他问题
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.