简体   繁体   中英

How to get the index of value of a column in another column of ArrayType() in pyspark?

I'm using spark 2.4.
I have an ArrayType(StringType()) column and a StringType() column in a spark dataframe. I need to find the position of the StringType() column in the ArrayType(StringType()) column.

Sample Input:

+---------------+---------+
|arrayCol       |stringCol|
+---------------+---------+
|['a', 'b', 'c']|'b'      |
+---------------+---------+
|['a', 'b', 'c']|'d'      |
+---------------+---------+

Sample Output:

+---------------+---------+-----+
|arrayCol       |stringCol|Index|
+---------------+---------+-----+
|['a', 'b', 'c']|'b'      |2    |
+---------------+---------+-----+
|['a', 'b', 'c']|'d'      |null |
+---------------+---------+-----+

I have tried array_position but it's not working and I'm getting "Column is not iterable" error.
I have also tried combining expr, transform, and array_position, but I'm wondering if there's a solution that doesn't need using expr .
Thanks :)

Try with expr with array_position function.

Example:

df.show()
#+---------+---------+
#| arrayCol|stringCol|
#+---------+---------+
#|[a, b, c]|        b|
#|[a, b, c]|        d|
#+---------+---------+

from pyspark.sql.functions import *
df.withColumn("Index",expr('if(array_position(arrayCol,stringCol)=0,null,array_position(arrayCol,stringCol))')).\
show()
#+---------+---------+-----+
#| arrayCol|stringCol|Index|
#+---------+---------+-----+
#|[a, b, c]|        b|    2|
#|[a, b, c]|        d| null|
#+---------+---------+-----+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM