简体   繁体   中英

performing a merge function in python, when I don't want the values to repeat

HI This is a follow up from one of my previous questions how do I perform a vlookup equivalent operation on my dataframe with some additional conditions

As in the other question, my first dataframe is

list = ['Computer', 'AA', 'Monitor', 'BB', 'Printer1', 'BB', 'Desk', 'AA', 'Printer2', 'DD', 'Desk', 'BB']
list2 = [1500, 232, 300, 2323, 150, 2323, 250, 2323, 23, 34, 45, 56]
df = pd.DataFrame(list,columns=['product'])
df['number'] = list2

and what if my 2nd dataframe has multiple values for say 'AA' as shown below

list_n = ['AA','AA','BB','BB','CC','DD']
list_n2 = ['Y','N','N','Y','N','Y']

df2 = pd.DataFrame(list_n,columns=['product'])
df2['to_add'] = list_n2

and this is how it would look

  product to_add
0      AA      Y
1      AA      N
2      BB      N
3      BB      Y
4      CC      N
5      DD      Y

when I perform pd.merge(df, df2, on="product", how="left") I get this

 product  number to_add
0   Computer    1500    NaN
1         AA     232      Y
2         AA     232      N
3    Monitor     300    NaN
4         BB    2323      N
5         BB    2323      Y
6    Printer1     150    NaN
7         BB    2323      N
8         BB    2323      Y
9       Desk     250    NaN
10        AA    2323      Y
11        AA    2323      N
12   Printer2      23    NaN
13        DD      34      Y
14      Desk      45    NaN
15        BB      56      N
16        BB      56      Y

As you can see now there are multiple rows for AA and BB. I just want the first value (or one of the values) for 'AA' (and 'BB') to be pull across (without altering the sequence of the dataframe of course). In short don't want multiple rows. just to clarify, my df2 has over 6000 rows and I don't know which entries are duplicated.

so the answer should look something line

     product  number to_add
0   Computer    1500    NaN
1         AA     232      Y
2    Monitor     300    NaN
3         BB    2323      N
4    Printer1     150    NaN
5         BB    2323      N
6       Desk     250    NaN
7         AA    2323      Y
8    Printer2      23    NaN
9         DD      34      Y
10      Desk      45    NaN
11        BB      56      N

Use:

df_m = pd.merge(df, df2, on="product", how="left")

m = df_m["product"].isin(df2["product"]) & df_m["product"].eq(df_m["product"].shift())
df_m = df_m[~m].reset_index(drop=True)
print(df_m)

This prints:

     product  number to_add
0   Computer    1500    NaN
1         AA     232      Y
2    Monitor     300    NaN
3         BB    2323      N
4   Printer1     150    NaN
5         BB    2323      N
6       Desk     250    NaN
7         AA    2323      Y
8   Printer2      23    NaN
9         DD      34      N
10      Desk      45    NaN
11        BB      56      N

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM