How can I subset a data frame for unique rows using repeating values from a column in another data frame in python?

Question

I have 2 data frames. I want to subset df_1 based on df_2 so that the rows in the resulting data frame correspond to the rows in df_2. Here are two example data frames:

df_1 = pd.DataFrame({
    "ID": ["Lemon","Banana","Apple","Cherry","Tomato","Blueberry","Avocado","Lime"], 
    "Color": ["Yellow","Yellow","Red","Red","Red","Blue","Green","Green"]})

df_2 = pd.DataFrame({"Color": ["Red","Blue","Yellow","Green","Red","Yellow"]})

My desired output is df_3, where the "Color" column is the same as in df_2:

df_3 = pd.DataFrame({
    "ID": ["Apple","Blueberry","Lemon","Avocado","Cherry","Banana"], 
    "Color": ["Red","Blue","Yellow","Green","Red","Yellow"]})

When I merge df_1 and df_2, I get duplicated rows because most of the rows in df_2 have multiple matches in df_1.

merged = df_2.merge(df_1, how="left", on="Color")

Dropping duplicates works properly for the "Yellow" color because it has a 2:2 ratio of values in df_2 and options in df_1, but it doesn't work properly for "Red" or "Green" because they have a 2:3 ratio and a 1:2 ratio respectively, resulting in extra rows.

no_duplicates = merged.drop_duplicates(subset = "ID")

Is there a way to subset df_1 where the first occurrence of "Red" in df_2 pulls out the first occurrence of "Red" in df_1, the second occurrence of "Red" in df_2 pulls out the second occurrence of "Red" in df_1, etc.? I would rather not use a loop unless I have no other choice. Thank you.

Answer 1

Try adding an indicator column to both df_1 and df_2 with groupby cumcount to get position as well:

df_1['i'] = df_1.groupby('Color').cumcount()
df_2['i'] = df_2.groupby('Color').cumcount()

df_1 :

          ID   Color  i
0      Lemon  Yellow  0
1     Banana  Yellow  1
2      Apple     Red  0
3     Cherry     Red  1
4     Tomato     Red  2
5  Blueberry    Blue  0
6    Avocado   Green  0
7       Lime   Green  1

df_2 :

    Color  i
0     Red  0
1    Blue  0
2  Yellow  0
3   Green  0
4     Red  1
5  Yellow  1

Then merge on both the indicator and the Color then drop the indicator column:

merged_df = df_1.merge(df_2, how='right', on=['Color', 'i']).drop('i', axis=1)

merged_df :

          ID   Color
0      Apple     Red
1  Blueberry    Blue
2      Lemon  Yellow
3    Avocado   Green
4     Cherry     Red
5     Banana  Yellow

Alternatively create pass the series directly to merge (this leaves df_1 and df_2 unaffected):

merged_df = df_1.merge(
    df_2, how='right',
    left_on=['Color', df_1.groupby('Color').cumcount()],
    right_on=['Color', df_2.groupby('Color').cumcount()]
).drop('key_1', axis=1)

merged_df :

          ID   Color
0      Apple     Red
1  Blueberry    Blue
2      Lemon  Yellow
3    Avocado   Green
4     Cherry     Red
5     Banana  Yellow

How can I subset a data frame for unique rows using repeating values from a column in another data frame in python?

Question

1 answers

solution1
1 ACCPTED 2021-06-23 03:17:19

How can I subset a data frame for unique rows using repeating values from a column in another data frame in python?

Question

1 answers

solution1 1 ACCPTED 2021-06-23 03:17:19

solution1
1 ACCPTED 2021-06-23 03:17:19