[英]Pyspark dataframe row-wise null columns list
I have a spark dataframe and I want to create a new column that contains the columns name having null in each row.我有一个火花 dataframe ,我想创建一个新列,其中包含每行中具有 null 的列名。 For example
例如
Original dataframe is:原装 dataframe 为:
|col_1|col_2|col_3|
+-----+-----+-----+
|62.45|null |62.49|
|56.45|null |null |
|null |null |null |
| 56.4|57.32|48.39|
+-----+-----+-----+------------------+
The final dataframe with the result column is:最终的 dataframe 与结果列是:
|col_1|col_2|col_3| Result|
+-----+-----+-----+------------------+
|62.45|null |62.49| col_2|
|56.45|null |null | col_2, col_3|
|null |null |null |col_1, col2, col_3|
| 56.4|57.32|48.39| |
+-----+-----+-----+------------------+
I know to get the number of null columns but looking for row-wise column names that can be different in each row.我知道要获取 null 列的数量,但要查找每行中可能不同的逐行列名。 Any guidance will be appreciated.
任何指导将不胜感激。
(df.withColumn('Result', array(*[when(col(c).isNull(), lit(c)) for c in df.columns]))#Lists Columns with null
.withColumn('Result', expr("filter(Result, x -> x is not null)"))#Exludes nulls from list
.show())
The easiest way is collecting each null column with column function isNull
then add them into an array with SQL function array
The easiest way is collecting each null column with column function
isNull
then add them into an array with SQL function array
(df
.withColumn('Result', F.array(
F.when(F.col('col_1').isNull(), 'col_1'),
F.when(F.col('col_2').isNull(), 'col_2'),
F.when(F.col('col_3').isNull(), 'col_3'),
))
.show()
)
# +-----+-----+-----+---------------------+
# |col_1|col_2|col_3|Result |
# +-----+-----+-----+---------------------+
# |62.45|null |62.49|[null, col_2, null] |
# |56.45|null |null |[null, col_2, col_3] |
# |null |null |null |[col_1, col_2, col_3]|
# |56.4 |57.32|48.39|[null, null, null] |
# +-----+-----+-----+---------------------+
You're obviously want to get rid of the null
s so apply filter
would do the job (Note: filter
only available since 3.1.0)您显然想摆脱
null
s 所以应用filter
可以完成这项工作(注意: filter
仅从 3.1.0 开始可用)
(df
.withColumn('Result', F.array(
F.when(F.col('col_1').isNull(), 'col_1'),
F.when(F.col('col_2').isNull(), 'col_2'),
F.when(F.col('col_3').isNull(), 'col_3'),
))
.withColumn('Result', F.filter(F.col('Result'), lambda c: c.isNotNull()))
.show(10, False)
)
# +-----+-----+-----+---------------------+
# |col_1|col_2|col_3|Result |
# +-----+-----+-----+---------------------+
# |62.45|null |62.49|[col_2] |
# |56.45|null |null |[col_2, col_3] |
# |null |null |null |[col_1, col_2, col_3]|
# |56.4 |57.32|48.39|[] |
# +-----+-----+-----+---------------------+
Finally, you also can improve code's flexibility by not hardcoding the columns but using a for
loop instead, and still got the same result:最后,您还可以通过不对列进行硬编码而是使用
for
循环来提高代码的灵活性,并且仍然得到相同的结果:
(df
.withColumn('Result', F.array(
*[F.when(F.col(c).isNull(), c) for c in df.columns]
))
.withColumn('Result', F.filter(F.col('Result'), lambda c: c.isNotNull()))
.show(10, False)
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.