[英]concatenating columns in a dataframe pyspark with null values
Data:
Name1 Name2 Name3(Expected)
RR Industries null RR Industries
RR Industries RR Industries RR IndustriesRR Industries
代碼:
.withColumn("Name3",F.concat(F.trim(Name1), F.trim(Name2)))
實際結果:null 值的列被刪除。 我希望 output 與 Name3(Expected Columnt) 中的一樣
我認為,加入表格后會出現問題。名稱列在 df2 和 df3 中可用。 在加入之前,它們不包含 null 值。
問題:加入后; 由於 pyspark 沒有刪除公共列,我們有 2 個表中的兩個 name1 列我嘗試用空字符串替換它;它不起作用並引發錯誤
加入表后如何用空字符串替換 null 值
df = df1\
.join(df2,"code",how = 'left') \
.join(df3,"id",how = 'left')\
.join(df4,"id",how = 'left')\
.withColumn('name1',F.when(df2('name1').isNull(),'').otherwise(df2('name1')))\
.withColumn('name1',F.when(df3('name1').isNull(),'').otherwise(df3('name1')))\
.withColumn("Name1",F.concat(F.trim(df2.name1), F.trim(df3.name1)))
嘗試這個-
它應該在 python 中實現,改動很小
val data =
"""
|Name1 | Name2
|RR Industries |
|RR Industries | RR Industries
""".stripMargin
val stringDS = data.split(System.lineSeparator())
.map(_.split("\\|").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString(","))
.toSeq.toDS()
val df = spark.read
.option("sep", ",")
.option("inferSchema", "true")
.option("header", "true")
.csv(stringDS)
df.show(false)
df.printSchema()
/**
* +-------------+-------------+
* |Name1 |Name2 |
* +-------------+-------------+
* |RR Industries|null |
* |RR Industries|RR Industries|
* +-------------+-------------+
*
* root
* |-- Name1: string (nullable = true)
* |-- Name2: string (nullable = true)
*/
df.withColumn("Name3(Expected)", concat_ws("", df.columns.map(col).map(c => coalesce(c, lit(""))): _*))
.show(false)
/**
* +-------------+-------------+--------------------------+
* |Name1 |Name2 |Name3(Expected) |
* +-------------+-------------+--------------------------+
* |RR Industries|null |RR Industries |
* |RR Industries|RR Industries|RR IndustriesRR Industries|
* +-------------+-------------+--------------------------+
*/
df.withColumn("Name3(Expected)", concat_ws("", df.columns.map(col): _*))
.show(false)
/**
* +-------------+-------------+--------------------------+
* |Name1 |Name2 |Name3(Expected) |
* +-------------+-------------+--------------------------+
* |RR Industries|null |RR Industries |
* |RR Industries|RR Industries|RR IndustriesRR Industries|
* +-------------+-------------+--------------------------+
*/
您可以在 pyspark 中嘗試這種方法
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder \
.appName('practice')\
.getOrCreate()
sc= spark.sparkContext
df = sc.parallelize([
("RR Industries",None), ("RR Industries", "RR Industries")]).toDF(["Name1",
"Name2"])
df.withColumn("Name3", F.concat_ws("", F.col("Name1"),
F.col("Name2"))).show(truncate=False)
+-------------+-------------+--------------------------+
|Name1 |Name2 |Name3 |
+-------------+-------------+--------------------------+
|RR Industries|null |RR Industries |
|RR Industries|RR Industries|RR IndustriesRR Industries|
+-------------+-------------+--------------------------+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.