简体   繁体   English

Pyspark dataframe 逐行 null 列列表

[英]Pyspark dataframe row-wise null columns list

I have a spark dataframe and I want to create a new column that contains the columns name having null in each row.我有一个火花 dataframe ,我想创建一个新列,其中包含每行中具有 null 的列名。 For example例如

Original dataframe is:原装 dataframe 为:

|col_1|col_2|col_3|

+-----+-----+-----+

|62.45|null |62.49|

|56.45|null |null |

|null |null |null |

| 56.4|57.32|48.39|

+-----+-----+-----+------------------+

The final dataframe with the result column is:最终的 dataframe 与结果列是:

|col_1|col_2|col_3|            Result|

+-----+-----+-----+------------------+

|62.45|null |62.49|             col_2|

|56.45|null |null |      col_2, col_3|

|null |null |null |col_1, col2, col_3|

| 56.4|57.32|48.39|                  |

+-----+-----+-----+------------------+

I know to get the number of null columns but looking for row-wise column names that can be different in each row.我知道要获取 null 列的数量,但要查找每行中可能不同的逐行列名。 Any guidance will be appreciated.任何指导将不胜感激。

  (df.withColumn('Result', array(*[when(col(c).isNull(), lit(c)) for c in df.columns]))#Lists Columns with null

.withColumn('Result', expr("filter(Result, x -> x is not null)"))#Exludes nulls from list
 
 .show())

The easiest way is collecting each null column with column function isNull then add them into an array with SQL function array The easiest way is collecting each null column with column function isNull then add them into an array with SQL function array

(df
    .withColumn('Result', F.array(
        F.when(F.col('col_1').isNull(), 'col_1'),
        F.when(F.col('col_2').isNull(), 'col_2'),
        F.when(F.col('col_3').isNull(), 'col_3'),
    ))
    .show()
)

# +-----+-----+-----+---------------------+
# |col_1|col_2|col_3|Result               |
# +-----+-----+-----+---------------------+
# |62.45|null |62.49|[null, col_2, null]  |
# |56.45|null |null |[null, col_2, col_3] |
# |null |null |null |[col_1, col_2, col_3]|
# |56.4 |57.32|48.39|[null, null, null]   |
# +-----+-----+-----+---------------------+

You're obviously want to get rid of the null s so apply filter would do the job (Note: filter only available since 3.1.0)您显然想摆脱null s 所以应用filter可以完成这项工作(注意: filter仅从 3.1.0 开始可用)

(df
    .withColumn('Result', F.array(
        F.when(F.col('col_1').isNull(), 'col_1'),
        F.when(F.col('col_2').isNull(), 'col_2'),
        F.when(F.col('col_3').isNull(), 'col_3'),
    ))
    .withColumn('Result', F.filter(F.col('Result'), lambda c: c.isNotNull()))
 
    .show(10, False)
)

# +-----+-----+-----+---------------------+
# |col_1|col_2|col_3|Result               |
# +-----+-----+-----+---------------------+
# |62.45|null |62.49|[col_2]              |
# |56.45|null |null |[col_2, col_3]       |
# |null |null |null |[col_1, col_2, col_3]|
# |56.4 |57.32|48.39|[]                   |
# +-----+-----+-----+---------------------+

Finally, you also can improve code's flexibility by not hardcoding the columns but using a for loop instead, and still got the same result:最后,您还可以通过不对列进行硬编码而是使用for循环来提高代码的灵活性,并且仍然得到相同的结果:

(df
    .withColumn('Result', F.array(
        *[F.when(F.col(c).isNull(), c) for c in df.columns]
    ))
    .withColumn('Result', F.filter(F.col('Result'), lambda c: c.isNotNull()))
 
    .show(10, False)
)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM