简体   繁体   中英

Pyspark. Create A new column based on multiple columns and multiple rows

I have a dataframe below:

id feature1 feature2 feature3 feature4
1 a 0 0 0
1 b 0 0 0
1 c 0 0 0
2 a 1 0 0
2 b 0 0 0
2 c 0 0 0
3 b 0 0 0
3 c 0 1 0

I need to get a new column and get the dataframe below:

id feature1 feature2 feature3 feature4 NewColumn
1 a 0 0 0 1
1 b 0 0 0 1
1 c 0 0 0 1
2 a 1 0 0 0
2 b 0 0 0 0
2 c 0 0 0 0
3 b 0 0 0 0
3 c 0 1 0 0

For the case when id=1 , newColumn should be 1 because all rows with feature2 , feature3 and feature4 are 0. However, for the case when id==2 NewColumn should be 0 because feature2 is 1 when feature1=a in this case. Same when id==3 , NewColumn should be 0 because feature3 contains 1. Basically, I need to consider all ids individually and check feature2 , feature3 and feature4 columns if they contain 1. How to implement it efficiently in Pyspark?

Since it needs to check values across rows, you can use collect_list of array of feature2 , feature3 and feature4 over window partitioned by column id , flatten the array, check if the array contains any value of 1 using exists , then cast the negated boolean value to integer. (for Spark version >= 2.4)

from pyspark.sql.functions import expr

df = df.withColumn('NewColumn', 
                   expr("""cast(
                             not exists(
                               flatten(
                                 collect_list(array(feature2,feature3,feature4)) over (partition by id)),
                               v -> v = 1)
                             as int)"""))
df.show()

# +---+--------+--------+--------+--------+---------+
# | id|feature1|feature2|feature3|feature4|NewColumn|
# +---+--------+--------+--------+--------+---------+
# |  1|       a|       0|       0|       0|        1|
# |  1|       b|       0|       0|       0|        1|
# |  1|       c|       0|       0|       0|        1|
# |  3|       b|       0|       0|       0|        0|
# |  3|       c|       0|       1|       0|        0|
# |  2|       a|       1|       0|       0|        0|
# |  2|       b|       0|       0|       0|        0|
# |  2|       c|       0|       0|       0|        0|
# +---+--------+--------+--------+--------+---------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM