I have 2 data frames. I want to subset df_1 based on df_2 so that the rows in the resulting data frame correspond to the rows in df_2. Here are two example data frames:
df_1 = pd.DataFrame({
"ID": ["Lemon","Banana","Apple","Cherry","Tomato","Blueberry","Avocado","Lime"],
"Color": ["Yellow","Yellow","Red","Red","Red","Blue","Green","Green"]})
df_2 = pd.DataFrame({"Color": ["Red","Blue","Yellow","Green","Red","Yellow"]})
My desired output is df_3, where the "Color" column is the same as in df_2:
df_3 = pd.DataFrame({
"ID": ["Apple","Blueberry","Lemon","Avocado","Cherry","Banana"],
"Color": ["Red","Blue","Yellow","Green","Red","Yellow"]})
When I merge df_1 and df_2, I get duplicated rows because most of the rows in df_2 have multiple matches in df_1.
merged = df_2.merge(df_1, how="left", on="Color")
Dropping duplicates works properly for the "Yellow" color because it has a 2:2 ratio of values in df_2 and options in df_1, but it doesn't work properly for "Red" or "Green" because they have a 2:3 ratio and a 1:2 ratio respectively, resulting in extra rows.
no_duplicates = merged.drop_duplicates(subset = "ID")
Is there a way to subset df_1 where the first occurrence of "Red" in df_2 pulls out the first occurrence of "Red" in df_1, the second occurrence of "Red" in df_2 pulls out the second occurrence of "Red" in df_1, etc.? I would rather not use a loop unless I have no other choice. Thank you.
Try adding an indicator column to both df_1
and df_2
with groupby cumcount
to get position as well:
df_1['i'] = df_1.groupby('Color').cumcount()
df_2['i'] = df_2.groupby('Color').cumcount()
df_1
:
ID Color i
0 Lemon Yellow 0
1 Banana Yellow 1
2 Apple Red 0
3 Cherry Red 1
4 Tomato Red 2
5 Blueberry Blue 0
6 Avocado Green 0
7 Lime Green 1
df_2
:
Color i
0 Red 0
1 Blue 0
2 Yellow 0
3 Green 0
4 Red 1
5 Yellow 1
Then merge
on both the indicator and the Color
then drop
the indicator column:
merged_df = df_1.merge(df_2, how='right', on=['Color', 'i']).drop('i', axis=1)
merged_df
:
ID Color
0 Apple Red
1 Blueberry Blue
2 Lemon Yellow
3 Avocado Green
4 Cherry Red
5 Banana Yellow
Alternatively create pass the series directly to merge
(this leaves df_1
and df_2
unaffected):
merged_df = df_1.merge(
df_2, how='right',
left_on=['Color', df_1.groupby('Color').cumcount()],
right_on=['Color', df_2.groupby('Color').cumcount()]
).drop('key_1', axis=1)
merged_df
:
ID Color
0 Apple Red
1 Blueberry Blue
2 Lemon Yellow
3 Avocado Green
4 Cherry Red
5 Banana Yellow
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.