简体   繁体   中英

How do I merge column values from one dataframe to another if they are not present in another using pandas

I have two different excel files which I read using pd.readExcel . The first excel file is kind of a master file which has a lot of columns. showing only those columns which are relevant: df1

Company Name                                              Excel Company ID
0                                    cleverbridge AG      IQ109133656
1  BT España, Compañía de Servicios Globales de T...        IQ3806173
2                                   Technoserv Group       IQ40333012
3                                    Blue Media S.A.       IQ50008102
4            zeb.rolfes.schierenbeck.associates gmbh       IQ30413992

and the second excel is basically an output excel file which looks like this: df2

company_id          found_keywords  no_of_url                                       company_name
0  IQ137156215      insurance         15                         Zühlke Technology Group AG
1    IQ3806173      insurance         15  BT España, Compañía de Servicios Globales de T...
2   IQ40333012      insurance          4                                   Technoserv Group
3   IQ51614192      insurance         15                             Octo Telematics S.p.A.

I want this output excel file/ df2 to include those company_id and company name from df1 where company id and company name from df1 is not a part of df2. Something like this: df2

company_id found_keywords  no_of_url                                       company_name
0  IQ137156215      insurance         15                         Zühlke Technology Group AG
1    IQ3806173      insurance         15  BT España, Compañía de Servicios Globales de T...
2   IQ40333012      insurance          4                                   Technoserv Group
3   IQ51614192      insurance         15                             Octo Telematics S.p.A.
4   IQ30413992      NaN               NaN              zeb.rolfes.schierenbeck.associates gmbh          

I tried several ways of achieveing this by using pd.merge as well as np.where I even tried reindexing based on columns but nothing worked out. What exactly do I need to do so that it works as expected. Please help me out.Thanks!

EDIT :

using pd.merge

df2.merge(df, right_on='company_id', left_on='Excel Company ID', how='outer')

which gave an output with [220 rows X 31 columns]

Your expected output is unclear. If you use pd.merge with how='outer' and indicator=True , you will have:

df1 = df1.rename(columns={'Company Name': 'company_name', 'Excel Company ID': 'company_id'})
out = df2.merge(df1, on=['company_id', 'company_name'], how='outer', indicator=True)

Output:

>>> out
    company_id found_keywords  no_of_url                                       company_name      _merge
0  IQ137156215      insurance       15.0                         Zühlke Technology Group AG   left_only
1    IQ3806173      insurance       15.0  BT España, Compañía de Servicios Globales de T...        both
2   IQ40333012      insurance        4.0                                   Technoserv Group        both
3   IQ51614192      insurance       15.0                             Octo Telematics S.p.A.   left_only
4  IQ109133656            NaN        NaN                                    cleverbridge AG  right_only
5   IQ50008102            NaN        NaN                                    Blue Media S.A.  right_only
6   IQ30413992            NaN        NaN            zeb.rolfes.schierenbeck.associates gmbh  right_only

Check the last column _merge . If you have right_only , it means the company_id and company_name are not found in df2 .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM