简体   繁体   中英

pyspark dataframes join by iterable column

I would like to join two pyspark dataframes based on multiple columns.

tab1:

id      name(string , size=3)  val. (Long int)
6725    fnc                    5219
8576    fnc                    829
9192    sct                    72912
782     sct                   1022

tab2:

name (string, size=6).    Val. (Array of long int)
fnceda                    [11, 25, 5219]
fncytfd                   [71, 829, 320]
sctvbd                    [357, 72912, 508]
sctgsd                    [796, 52, 67]

I need to get a new table such that

  the “name” in “tab1”  match the first 3 letter in “name” of “tab2” 
  and also the “val” in “tab1” appear in the “val” of “tab2”. 
  All other rows that do not satisfy the condition need to be removed. 


id      name(string , size=3)  val. (Long int)
6725    fnc                    5219
8576    fnc                    829
9192    sct                    72912
 

My code:

 tab1.join(tab2, 
                 tab1[‘’name”]==F.substring(tab2[“name”], 1, 3),
                 & F.array_contains(tab2[“val”], tab1[“val”]),
                “inner”
              )

Got error:

     Column is not iterable 

It seems that an array column cannot be used as a join condition?

Thanks

This can be accomplished in 3 steps.

Step 1: Create a new column in tab2 with by obtaining substring

from pyspark.sql.functions import substring, explode
tab2_df = tab2_df.withColumn('new_name', substring('name', 0, 3))

Step 2: Explode tab2.val so you have long values instead of array of long.

tab2_df = tab2_df.withColumn('value', explode('Val))

Step 3: Perform a join between tab1 and tab2 by comparing name w/ new_name, and val w/ value

tab3_df = tab1_df.join(tab2_df, [(tab1_df.name == tab2_df.new_name) & (tab1_df.val == tab2_df.value)], how="inner")
display(tab3_df)

You need to wrap your first condition in parentheses, then you'll be fine

df.join(df2, (df['name']==F.substring(df2['name'], 1, 3)) & F.array_contains(df2['val'], df['val']), 'inner').show()

+----+----+-----+-------+-----------------+
|  id|name|  val|   name|              val|
+----+----+-----+-------+-----------------+
|6725| fnc| 5219| fnceda|   [11, 25, 5219]|
|8576| fnc|  829|fncytfd|   [71, 829, 320]|
|9192| sct|72912| sctvbd|[357, 72912, 508]|
+----+----+-----+-------+-----------------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM