简体   繁体   中英

pyspark join with more conditions

I am trying to join two dataframe with "left" with conditions with "item"

If df2 doesn't have the "equivalent_item", then I want to use df1 "item" itself. If df2 "equivalent_item" is null (Eg Kiwi) then equivalent item should be null and later I can drop that row.

df1:

name     item
jack     rice
hari     banana
mala     apples
kin      kiwi
Mike     salt
fall     sugar
yedy     pasta
vall     fruits   

df2:

item     equivalent_item
rice      basmathi
banana    delmonte 
apples    fuji apple
kiwi 
pasta     barello

Expected Output:

name     items        equivalent_item
jack     rice         basmathi
hari     banana       delmonte
mala     apples       fuji apple
kin      kiwi
Mike     salt         salt
fall     sugar        sugar
yedy     pasta        barello
vall     fruits       fruits  

I had to do like below:

def equivalent_name(name):
    elif name == 'rice':
        return 'basmathi'
    elif name == 'banana':
        return 'delmonte'
    elif name == 'apples':
        return 'fuji apple'
    elif name == 'apples':
        return 'fuji apple'
    elif name == 'pasta':
        return 'barello'
    else
        return name

df1['name'] = df1['name'].apply(equivalent_name)

Do left join using df.join()

df1.join(df2, ["item"], "left")

In case if join columns has different names in dataframes then use.

df1.join(df2, df1["item_1"]=df2["item_2"], "left")

This will result in having dataframe with both item_1 and item_2 columns, you can drop the one which is not required.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM