[英]How to replace NULL to 0 in left outer join in SPARK dataframe v1.6
I am working Spark v1.6.我正在使用 Spark v1.6。 I have the following two DataFrames and I want to convert the null to 0 in my left outer join ResultSet.我有以下两个数据帧,我想在我的左外连接 ResultSet 中将 null 转换为 0。 Any suggestions?有什么建议吗?
val x: Array[Int] = Array(1,2,3)
val df_sample_x = sc.parallelize(x).toDF("x")
val y: Array[Int] = Array(3,4,5)
val df_sample_y = sc.parallelize(y).toDF("y")
val df_sample_join = df_sample_x
.join(df_sample_y,df_sample_x("x") === df_sample_y("y"),"left_outer")
scala> df_sample_join.show
x | y
--------
1 | null
2 | null
3 | 3
But I want the resultset to be displayed as.
-----------------------------------------------
scala> df_sample_join.show
x | y
--------
1 | 0
2 | 0
3 | 3
只需使用na.fill
:
df.na.fill(0, Seq("y"))
Try:尝试:
val withReplacedNull = df_sample_join.withColumn("y", coalesce('y, lit(0)))
Tested on:测试:
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions.{col, udf}
import org.apache.spark.sql.types._
val list = List(Row("a", null), Row("b", null), Row("c", 1));
val rdd = sc.parallelize(list);
val schema = StructType(
StructField("text", StringType, false) ::
StructField("y", IntegerType, false) :: Nil)
val df = sqlContext.createDataFrame(rdd, schema)
val df1 = df.withColumn("y", coalesce('y, lit(0)));
df1.show()
You can fix your existing dataframe like this:您可以像这样修复现有的数据框:
import org.apache.spark.sql.functions.{when,lit}
val correctedDf=df_sample_join.withColumn("y", when($"y".isNull,lit(0)).otherwise($"y"))
Although T. Gawęda's answer also works, I think this is more readable尽管 T. Gawęda 的回答也有效,但我认为这更具可读性
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.