連接 dataframe pyspark 中的列與 null 值

Question

Data:
Name1            Name2            Name3(Expected)
RR Industries    null            RR Industries
RR Industries    RR Industries   RR IndustriesRR Industries

代碼：

.withColumn("Name3",F.concat(F.trim(Name1), F.trim(Name2)))

實際結果：null 值的列被刪除。 我希望 output 與 Name3(Expected Columnt) 中的一樣

我認為，加入表格后會出現問題。名稱列在 df2 和 df3 中可用。 在加入之前，它們不包含 null 值。

問題：加入后； 由於 pyspark 沒有刪除公共列，我們有 2 個表中的兩個 name1 列我嘗試用空字符串替換它；它不起作用並引發錯誤

加入表后如何用空字符串替換 null 值

df = df1\
.join(df2,"code",how = 'left') \
.join(df3,"id",how = 'left')\
.join(df4,"id",how = 'left')\
.withColumn('name1',F.when(df2('name1').isNull(),'').otherwise(df2('name1')))\
.withColumn('name1',F.when(df3('name1').isNull(),'').otherwise(df3('name1')))\
.withColumn("Name1",F.concat(F.trim(df2.name1), F.trim(df3.name1)))

Answer 1

嘗試這個-

它應該在 python 中實現，改動很小

   val data =
      """
        |Name1         |   Name2
        |RR Industries |
        |RR Industries |   RR Industries
      """.stripMargin

    val stringDS = data.split(System.lineSeparator())
      .map(_.split("\\|").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString(","))
      .toSeq.toDS()
    val df = spark.read
      .option("sep", ",")
      .option("inferSchema", "true")
      .option("header", "true")
      .csv(stringDS)
    df.show(false)
    df.printSchema()

    /**
      * +-------------+-------------+
      * |Name1        |Name2        |
      * +-------------+-------------+
      * |RR Industries|null         |
      * |RR Industries|RR Industries|
      * +-------------+-------------+
      *
      * root
      * |-- Name1: string (nullable = true)
      * |-- Name2: string (nullable = true)
      */
    df.withColumn("Name3(Expected)", concat_ws("", df.columns.map(col).map(c => coalesce(c, lit(""))): _*))
      .show(false)

    /**
      * +-------------+-------------+--------------------------+
      * |Name1        |Name2        |Name3(Expected)           |
      * +-------------+-------------+--------------------------+
      * |RR Industries|null         |RR Industries             |
      * |RR Industries|RR Industries|RR IndustriesRR Industries|
      * +-------------+-------------+--------------------------+
      */
    df.withColumn("Name3(Expected)", concat_ws("", df.columns.map(col): _*))
      .show(false)

    /**
      * +-------------+-------------+--------------------------+
      * |Name1        |Name2        |Name3(Expected)           |
      * +-------------+-------------+--------------------------+
      * |RR Industries|null         |RR Industries             |
      * |RR Industries|RR Industries|RR IndustriesRR Industries|
      * +-------------+-------------+--------------------------+
      */

Answer 2

您可以在 pyspark 中嘗試這種方法

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder \
.appName('practice')\
.getOrCreate()

sc= spark.sparkContext

df = sc.parallelize([
("RR Industries",None), ("RR Industries", "RR Industries")]).toDF(["Name1", 
  "Name2"])


 df.withColumn("Name3", F.concat_ws("", F.col("Name1"), 
 F.col("Name2"))).show(truncate=False)

+-------------+-------------+--------------------------+
|Name1        |Name2        |Name3                     |
+-------------+-------------+--------------------------+
|RR Industries|null         |RR Industries             |
|RR Industries|RR Industries|RR IndustriesRR Industries|
+-------------+-------------+--------------------------+

連接 dataframe pyspark 中的列與 null 值

問題描述

2 個解決方案

解決方案1
1 2020-06-01 15:16:53

解決方案2
0 2020-06-02 03:17:44

連接 dataframe pyspark 中的列與 null 值

問題描述

2 個解決方案

解決方案1 1 2020-06-01 15:16:53

解決方案2 0 2020-06-02 03:17:44

解決方案1
1 2020-06-01 15:16:53

解決方案2
0 2020-06-02 03:17:44