简体   繁体   中英

how to merge pandas dataframe with different lengths

I have a pandas dataframe like following..

df_fav_dish 

item_id buyer_id  dish_count dish_name
  121      261           2      Null
  126      261           3      Null
  131      261           7      Null
  132      261           6      Null
  133      261           2      Null
  135      261           2      Null
  139      309           2      Null
  140      261           2      Null
  142      261           2      Null
  143      153           3      Null
  145       64           2      Null
  148      261           2      Null
  155      261           2      Null
  156       64           2      Null
  163      261           2      Null

length of above dataframe is 34. And I have another dataframe like following..

 data

 item_id      item_name
     121        Paneer 
     126       Chicken 
     131        Prawns
     132        Mutton
     133        Curd
     139       Mocktail
     140       Cocktail
     142       Biryani  
     143      Thai Curry
     145      Red Curry
     148        Fish
     155      Lobster
      69        Fish Curry  
      67       Butter
      31       Bread      
      59      Egg Curry

length of above dataframe is 322 .This data frame contains almost 300 item_id and corresponding item names Now I want to join this two dataframes on item_id. Two dataframes are of different lengths. I am doing following in python.

df_fav_dish.merge(data[['item_name','item_id']],how='left',on='item_id')

But it gives me many rows. I just want to add item_name to the first data frame from second dataframe where both the item_id equal to each other

Desired output is

item_id buyer_id  dish_count dish_name  item_name
  121      261           2      Null     paneer
  126      261           3      Null     Chicken
  131      261           7      Null     prawns
  132      261           6      Null     Mutton
  133      261           2      Null     Curd
  135      261           2      Null     
  139      309           2      Null     Mocktail
  140      261           2      Null     Cocktail
  142      261           2      Null     Biryani
  143      153           3      Null     Thai Curry
  145       64           2      Null     Red Curry
  148      261           2      Null     Fish
  155      261           2      Null     Lobster
  156       64           2      Null     
  163      261           2      Null     

Your column item_id in dataframe data contains duplicity, so:

If no duplicity:

print data
   item_id item_name
0      121    Paneer
1      140   Chicken
2      131    Prawns

print df_fav_dish 
   item_id  buyer_id  dish_count dish_name
0      139       309           2      Null
1      140       261           2      Null
2      142       261           2      Null
3      143       153           3      Null

print df_fav_dish.merge(data[['item_name','item_id']],how='left',on='item_id')
   item_id  buyer_id  dish_count dish_name item_name
0      139       309           2      Null       NaN
1      140       261           2      Null   Chicken
2      142       261           2      Null       NaN
3      143       153           3      Null       NaN

With duplicity all duplicity rows are joined:

print data
   item_id item_name
0      140    Paneer
1      140   Chicken
2      140    Prawns

print df_fav_dish 
   item_id  buyer_id  dish_count dish_name
0      139       309           2      Null
1      140       261           2      Null
2      142       261           2      Null
3      143       153           3      Null

print df_fav_dish.merge(data[['item_name','item_id']],how='left',on='item_id')
   item_id  buyer_id  dish_count dish_name item_name
0      139       309           2      Null       NaN
1      140       261           2      Null    Paneer
2      140       261           2      Null   Chicken
3      140       261           2      Null    Prawns
4      142       261           2      Null       NaN
5      143       153           3      Null       NaN

So you can drop_duplicates :

# Drop duplicates except for the first occurrence
print df.drop_duplicates(subset='item_id', keep='first')
   item_id  buyer_id  dish_count dish_name item_name
0      139       309           2      Null       NaN
1      140       261           2      Null    Paneer
4      142       261           2      Null       NaN
5      143       153           3      Null       NaN

# Drop duplicates except for the last occurrence
print df.drop_duplicates(subset='item_id', keep='last')
   item_id  buyer_id  dish_count dish_name item_name
0      139       309           2      Null       NaN
3      140       261           2      Null    Prawns
4      142       261           2      Null       NaN
5      143       153           3      Null       NaN

# Drop all duplicates
print df.drop_duplicates(subset='item_id', keep=False)
   item_id  buyer_id  dish_count dish_name item_name
0      139       309           2      Null       NaN
4      142       261           2      Null       NaN
5      143       153           3      Null       NaN
df_new = pd.merge(df_fav_dish,data[['item_name','item_id']],left_on='item_id',right_on='item_id',how='left')

It's kind off the same code as yours.

I assume each item_id has an unique item_name? If not, thats why you're getting more rows than the lengt off the 'df_fav_dish' dataframe..

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM