简体   繁体   English


[英]Replacing null values with 0 after spark dataframe left outer join

I have two dataframes called left and right . 我有两个名为leftright的数据帧。

scala> left.printSchema
|-- user_uid: double (nullable = true)
|-- labelVal: double (nullable = true)
|-- probability_score: double (nullable = true)

scala> right.printSchema
|-- user_uid: double (nullable = false)
|-- real_labelVal: double (nullable = false)

Then, I join them to get the joined Dataframe. 然后,我加入他们以获得加入的Dataframe。 It is a left outer join . 这是一个左外连接 Anyone interested in the natjoin function can find it here. 任何对natjoin函数感兴趣的人都可以在这里找到它。

https://gist.github.com/anonymous/f02bd79528ac75f57ae8 https://gist.github.com/anonymous/f02bd79528ac75f57ae8

scala> val joinedData = natjoin(predictionDataFrame, labeledObservedDataFrame, "left_outer")

scala> joinedData.printSchema
|-- user_uid: double (nullable = true)
|-- labelVal: double (nullable = true)
|-- probability_score: double (nullable = true)
|-- real_labelVal: double (nullable = false)

Since it is a left outer join, the real_labelVal column has nulls when user_uid is not present in right. 由于它是左外连接,因此当user_uid不在右边时,real_labelVal列具有空值。

scala> val realLabelVal = joinedData.select("real_labelval").distinct.collect
realLabelVal: Array[org.apache.spark.sql.Row] = Array([0.0], [null])

I want to replace the null values in the realLabelVal column with 1.0. 我想用1.0替换realLabelVal列中的空值。

Currently I do the following: 目前我做以下事情:

  1. I find the index of real_labelval column and use the spark.sql.Row API to set the nulls to 1.0. 我找到了real_labelval列的索引,并使用spark.sql.Row API将空值设置为1.0。 (This gives me a RDD[Row]) (这给了我一个RDD [Row])
  2. Then I apply the schema of the joined dataframe to get the cleaned dataframe. 然后,我应用连接的数据帧的模式来获取已清理的数据帧。

The code is as follows: 代码如下:

 val real_labelval_index = 3
 def replaceNull(row: Row) = {
    val rowArray = row.toSeq.toArray
     rowArray(real_labelval_index) = 1.0

 val cleanRowRDD = joinedData.map(row => if (row.isNullAt(real_labelval_index)) replaceNull(row) else row)
 val cleanJoined = sqlContext.createDataFrame(cleanRowRdd, joinedData.schema)

Is there an elegant or efficient way to do this? 有优雅或有效的方法吗?

Goolging hasn't helped much. Goolging没有多大帮助。 Thanks in advance. 提前致谢。


joinedData.na.fill(1.0, Seq("real_labelval"))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM