[英]pyspark join with more conditions
I am trying to join two dataframe with "left" with conditions with "item"我正在尝试将两个数据框与“左”与“项目”的条件连接起来
If df2 doesn't have the "equivalent_item", then I want to use df1 "item" itself.如果 df2 没有“equivalent_item”,那么我想使用 df1“item”本身。 If df2 "equivalent_item" is null (Eg Kiwi) then equivalent item should be null and later I can drop that row.如果 df2 "equivalent_item" 为空(例如 Kiwi),则等效项应为空,稍后我可以删除该行。
df1: df1:
name item
jack rice
hari banana
mala apples
kin kiwi
Mike salt
fall sugar
yedy pasta
vall fruits
df2: df2:
item equivalent_item
rice basmathi
banana delmonte
apples fuji apple
kiwi
pasta barello
Expected Output:预期输出:
name items equivalent_item
jack rice basmathi
hari banana delmonte
mala apples fuji apple
kin kiwi
Mike salt salt
fall sugar sugar
yedy pasta barello
vall fruits fruits
I had to do like below:我必须这样做:
def equivalent_name(name):
elif name == 'rice':
return 'basmathi'
elif name == 'banana':
return 'delmonte'
elif name == 'apples':
return 'fuji apple'
elif name == 'apples':
return 'fuji apple'
elif name == 'pasta':
return 'barello'
else
return name
df1['name'] = df1['name'].apply(equivalent_name)
Do left join using df.join()使用 df.join() 进行左连接
df1.join(df2, ["item"], "left")
In case if join columns has different names in dataframes then use.如果连接列在数据框中具有不同的名称,则使用。
df1.join(df2, df1["item_1"]=df2["item_2"], "left")
This will result in having dataframe with both item_1 and item_2 columns, you can drop the one which is not required.这将导致数据框同时包含 item_1 和 item_2 列,您可以删除不需要的那个。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.