简体   繁体   English

pyspark 数据框按可迭代列连接

[英]pyspark dataframes join by iterable column

I would like to join two pyspark dataframes based on multiple columns.我想加入两个基于多列的 pyspark 数据框。

tab1:选项卡 1:

id      name(string , size=3)  val. (Long int)
6725    fnc                    5219
8576    fnc                    829
9192    sct                    72912
782     sct                   1022

tab2:选项卡 2:

name (string, size=6).    Val. (Array of long int)
fnceda                    [11, 25, 5219]
fncytfd                   [71, 829, 320]
sctvbd                    [357, 72912, 508]
sctgsd                    [796, 52, 67]

I need to get a new table such that我需要一张新桌子,这样

  the “name” in “tab1”  match the first 3 letter in “name” of “tab2” 
  and also the “val” in “tab1” appear in the “val” of “tab2”. 
  All other rows that do not satisfy the condition need to be removed. 


id      name(string , size=3)  val. (Long int)
6725    fnc                    5219
8576    fnc                    829
9192    sct                    72912
 

My code:我的代码:

 tab1.join(tab2, 
                 tab1[‘’name”]==F.substring(tab2[“name”], 1, 3),
                 & F.array_contains(tab2[“val”], tab1[“val”]),
                “inner”
              )

Got error:出现错误:

     Column is not iterable 

It seems that an array column cannot be used as a join condition?好像不能用数组列作为连接条件?

Thanks谢谢

This can be accomplished in 3 steps.这可以通过 3 个步骤完成。

Step 1: Create a new column in tab2 with by obtaining substring第 1 步:通过获取 substring 在 tab2 中创建一个新列

from pyspark.sql.functions import substring, explode
tab2_df = tab2_df.withColumn('new_name', substring('name', 0, 3))

Step 2: Explode tab2.val so you have long values instead of array of long.第 2 步:展开 tab2.val,这样您就有了长值而不是长数组。

tab2_df = tab2_df.withColumn('value', explode('Val))

Step 3: Perform a join between tab1 and tab2 by comparing name w/ new_name, and val w/ value第 3 步:通过比较 name 与 new_name 和 val 与 value 来执行 tab1 和 tab2 之间的连接

tab3_df = tab1_df.join(tab2_df, [(tab1_df.name == tab2_df.new_name) & (tab1_df.val == tab2_df.value)], how="inner")
display(tab3_df)

You need to wrap your first condition in parentheses, then you'll be fine您需要将第一个条件用括号括起来,然后就可以了

df.join(df2, (df['name']==F.substring(df2['name'], 1, 3)) & F.array_contains(df2['val'], df['val']), 'inner').show()

+----+----+-----+-------+-----------------+
|  id|name|  val|   name|              val|
+----+----+-----+-------+-----------------+
|6725| fnc| 5219| fnceda|   [11, 25, 5219]|
|8576| fnc|  829|fncytfd|   [71, 829, 320]|
|9192| sct|72912| sctvbd|[357, 72912, 508]|
+----+----+-----+-------+-----------------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM