PySpark equivalent to pandas .isin()

Question

I have the following PySpark DataFrame

data = [
    ('foo'), 
    ('baz'), 
    ('bar'), 
    ('qux')
]
df = spark.createDataFrame(data, ( "group"))

Now I want to create a new column number that is 0 if group is in the list zeros = ['baz', 'qux'] , 1 if it is in ones = ['foo'] and 2 otherwise. In pandas I'd use .isin() but I don't understand how to solve that in PySpark.

Here is what I've tried but it does not work

df.withColumn("number", 
                func.when(func.col("group")  == array(*[lit(x) for x in ones])), 1)
               .otherwise(2))

Answer 1

You can also use isin in Pyspark. See the syntax below:

import pyspark.sql.functions as F

zeros = ['baz', 'qux']
ones = ['foo']

df2 = df.withColumn('number',
    F.when(F.col('group').isin(zeros), 0)
     .when(F.col('group').isin(ones), 1)
     .otherwise(2)
)

df2.show()
+-----+------+
|group|number|
+-----+------+
|  foo|     1|
|  baz|     0|
|  bar|     2|
|  qux|     0|
+-----+------+

PySpark equivalent to pandas .isin()

Question

1 answers

solution1
3 ACCPTED 2021-05-05 12:39:59

PySpark equivalent to pandas .isin()

Question

1 answers

solution1 3 ACCPTED 2021-05-05 12:39:59

solution1
3 ACCPTED 2021-05-05 12:39:59