How to filter row by row in Spark DataFrame?

Question

I have a spark DataFrame like this:

 code             list_code
 1002             [1005, 1006, 1007, ....]
 1005             [1005, 1009, 1101, ....]

How can I filter code where code in list_code using pyspark. Somehow it is row by row value. Normal code won't work like:

df.filter((df.code.isin(df.list_code)))

Answer 1

Use array_contains as suggested in the comments:

import pyspark.sql.functions as F

df2 = df.filter(F.array_contains(F.col('list_code'), F.col('code')))

Answer 2

isin() works in pyspark when the list is an input, not a column. Check this

df=spark.sql(""" with t1 (
 select 1002 code, array(1005, 1006, 1007) list_code union all 
 select 1005 code, array(1005, 1009, 1101) list_code
 ) select code, list_code from t1
 """)
df.show()


+----+------------------+
|code|         list_code|
+----+------------------+
|1002|[1005, 1006, 1007]|
|1005|[1005, 1009, 1101]|
+----+------------------+

in_arr=[2002,3002,1002]

df.filter((df.code.isin(in_arr))).show()

+----+------------------+
|code|         list_code|
+----+------------------+
|1002|[1005, 1006, 1007]|
+----+------------------+

If you want to use compare one column with another column, then use array_contains() function

df.createOrReplaceTempView("df")
spark.sql("  select code, list_code from df where array_contains(list_code, code) ").show()

+----+------------------+
|code|         list_code|
+----+------------------+
|1005|[1005, 1009, 1101]|
+----+------------------+

How to filter row by row in Spark DataFrame?

Question

2 answers

solution1
3 ACCPTED 2020-12-18 06:15:03

solution2
1 2020-12-18 12:17:19

How to filter row by row in Spark DataFrame?

Question

2 answers

solution1 3 ACCPTED 2020-12-18 06:15:03

solution2 1 2020-12-18 12:17:19

solution1
3 ACCPTED 2020-12-18 06:15:03

solution2
1 2020-12-18 12:17:19