簡體   English   中英

如何在 PySpark 中查找非 Null 值的列集合

[英]How to Find Columns Collection with Not Null Values in PySpark

我有一個 Pyspark Dataframe 與 n 列(Column_1,Column_2...... Column_n)。 我必須再添加一列,其中以逗號分隔的列集合。

條件:如果兩個或多個列有值填充集合列中的逗號分隔值,例如。 下面是三列的數據。

----------------------------------------------------------------------
| column_1  | column_2 | column_3 |             col collections      |
----------------------------------------------------------------------
|     -     |     -    |     -    |                  -               |
------------------------------------------ ---------------------------
|     1     |     -    |     -    |                  -               |
------------------------------------------ ---------------------------
|     -     |     1    |     -    |                  -               |
------------------------------------------ ---------------------------
|     -     |     -    |     1    |                  -               |
------------------------------------------ ---------------------------
|     1     |     1    |     -    | column_1,column_2                |
----------------------------------------------------------------------
|     1     |     1    |     1    | column_1,column_2,column_3       |
----------------------------------------------------------------------
|     1     |     -    |     -    |                      -           |
----------------------------------------------------------------------
|     -     |     1    |     1    | column_2,column_3                |
----------------------------------------------------------------------

這是一種解決方案。

import pandas as pd
from pyspark.sql.functions import concat_ws, udf
from pyspark.sql.types import StringType

pandas_df = pd.DataFrame({
    'column_1': [None, '1', None, None, '1', '1', '1'],
    'column_2': [None, None, '1', None, '1', '1', None],
    'column_3': [None, None, None, '1', None, '1', None]
})

df = spark.createDataFrame(pandas_df)
df.show()
# +--------+--------+--------+
# |column_1|column_2|column_3|
# +--------+--------+--------+
# |    null|    null|    null|
# |       1|    null|    null|
# |    null|       1|    null|
# |    null|    null|       1|
# |       1|       1|    null|
# |       1|       1|       1|
# |       1|    null|    null|
# +--------+--------+--------+


def non_null_to_column_name(name):
    return udf(lambda value: None if value is None else name, StringType())

atleast_two_udf = udf(lambda s: None if (s is None) or (',' not in s) else s, 
                      StringType())

cols = []
for name in df.columns:
    f = non_null_to_column_name(name)
    cols += [f(df[name])]

df = df.withColumn('collection', atleast_two_udf(concat_ws(',', *cols)))
df.show()
# +--------+--------+--------+--------------------+
# |column_1|column_2|column_3|          collection|
# +--------+--------+--------+--------------------+
# |    null|    null|    null|                null|
# |       1|    null|    null|                null|
# |    null|       1|    null|                null|
# |    null|    null|       1|                null|
# |       1|       1|    null|   column_1,column_2|
# |       1|       1|       1|column_1,column_2...|
# |       1|    null|    null|                null|
# +--------+--------+--------+--------------------+

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM