简体   繁体   中英

PySpark equivalent to pandas .isin()

I have the following PySpark DataFrame

data = [
    ('foo'), 
    ('baz'), 
    ('bar'), 
    ('qux')
]
df = spark.createDataFrame(data, ( "group"))

Now I want to create a new column number that is 0 if group is in the list zeros = ['baz', 'qux'] , 1 if it is in ones = ['foo'] and 2 otherwise. In pandas I'd use .isin() but I don't understand how to solve that in PySpark.

Here is what I've tried but it does not work

df.withColumn("number", 
                func.when(func.col("group")  == array(*[lit(x) for x in ones])), 1)
               .otherwise(2))

You can also use isin in Pyspark. See the syntax below:

import pyspark.sql.functions as F

zeros = ['baz', 'qux']
ones = ['foo']

df2 = df.withColumn('number',
    F.when(F.col('group').isin(zeros), 0)
     .when(F.col('group').isin(ones), 1)
     .otherwise(2)
)

df2.show()
+-----+------+
|group|number|
+-----+------+
|  foo|     1|
|  baz|     0|
|  bar|     2|
|  qux|     0|
+-----+------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM