[英]How to Find Columns Collection with Not Null Values in PySpark
我有一個 Pyspark Dataframe 與 n 列(Column_1,Column_2...... Column_n)。 我必須再添加一列,其中以逗號分隔的列集合。
條件:如果兩個或多個列有值填充集合列中的逗號分隔值,例如。 下面是三列的數據。
----------------------------------------------------------------------
| column_1 | column_2 | column_3 | col collections |
----------------------------------------------------------------------
| - | - | - | - |
------------------------------------------ ---------------------------
| 1 | - | - | - |
------------------------------------------ ---------------------------
| - | 1 | - | - |
------------------------------------------ ---------------------------
| - | - | 1 | - |
------------------------------------------ ---------------------------
| 1 | 1 | - | column_1,column_2 |
----------------------------------------------------------------------
| 1 | 1 | 1 | column_1,column_2,column_3 |
----------------------------------------------------------------------
| 1 | - | - | - |
----------------------------------------------------------------------
| - | 1 | 1 | column_2,column_3 |
----------------------------------------------------------------------
這是一種解決方案。
import pandas as pd
from pyspark.sql.functions import concat_ws, udf
from pyspark.sql.types import StringType
pandas_df = pd.DataFrame({
'column_1': [None, '1', None, None, '1', '1', '1'],
'column_2': [None, None, '1', None, '1', '1', None],
'column_3': [None, None, None, '1', None, '1', None]
})
df = spark.createDataFrame(pandas_df)
df.show()
# +--------+--------+--------+
# |column_1|column_2|column_3|
# +--------+--------+--------+
# | null| null| null|
# | 1| null| null|
# | null| 1| null|
# | null| null| 1|
# | 1| 1| null|
# | 1| 1| 1|
# | 1| null| null|
# +--------+--------+--------+
def non_null_to_column_name(name):
return udf(lambda value: None if value is None else name, StringType())
atleast_two_udf = udf(lambda s: None if (s is None) or (',' not in s) else s,
StringType())
cols = []
for name in df.columns:
f = non_null_to_column_name(name)
cols += [f(df[name])]
df = df.withColumn('collection', atleast_two_udf(concat_ws(',', *cols)))
df.show()
# +--------+--------+--------+--------------------+
# |column_1|column_2|column_3| collection|
# +--------+--------+--------+--------------------+
# | null| null| null| null|
# | 1| null| null| null|
# | null| 1| null| null|
# | null| null| 1| null|
# | 1| 1| null| column_1,column_2|
# | 1| 1| 1|column_1,column_2...|
# | 1| null| null| null|
# +--------+--------+--------+--------------------+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.