[英]How do I merge column values from one dataframe to another if they are not present in another using pandas
I have two different excel files which I read using pd.readExcel
.我有两个不同的 Excel 文件,我使用
pd.readExcel
读取它们。 The first excel file is kind of a master file which has a lot of columns.第一个 excel 文件是一种主文件,它有很多列。 showing only those columns which are relevant: df1
仅显示那些相关的列:df1
Company Name Excel Company ID
0 cleverbridge AG IQ109133656
1 BT España, Compañía de Servicios Globales de T... IQ3806173
2 Technoserv Group IQ40333012
3 Blue Media S.A. IQ50008102
4 zeb.rolfes.schierenbeck.associates gmbh IQ30413992
and the second excel is basically an output excel file which looks like this: df2第二个 excel 基本上是一个输出 excel 文件,如下所示:df2
company_id found_keywords no_of_url company_name
0 IQ137156215 insurance 15 Zühlke Technology Group AG
1 IQ3806173 insurance 15 BT España, Compañía de Servicios Globales de T...
2 IQ40333012 insurance 4 Technoserv Group
3 IQ51614192 insurance 15 Octo Telematics S.p.A.
I want this output excel file/ df2 to include those company_id and company name from df1 where company id and company name from df1 is not a part of df2.我希望这个输出 excel 文件/df2 包含来自 df1 的那些 company_id 和公司名称,其中来自 df1 的公司 ID 和公司名称不是 df2 的一部分。 Something like this: df2
像这样的东西:df2
company_id found_keywords no_of_url company_name
0 IQ137156215 insurance 15 Zühlke Technology Group AG
1 IQ3806173 insurance 15 BT España, Compañía de Servicios Globales de T...
2 IQ40333012 insurance 4 Technoserv Group
3 IQ51614192 insurance 15 Octo Telematics S.p.A.
4 IQ30413992 NaN NaN zeb.rolfes.schierenbeck.associates gmbh
I tried several ways of achieveing this by using pd.merge
as well as np.where
I even tried reindexing based on columns but nothing worked out.我尝试了几种通过使用
pd.merge
和np.where
来实现此目的的方法,我什至尝试了基于列的重新索引,但没有任何结果。 What exactly do I need to do so that it works as expected.我到底需要做什么才能按预期工作。 Please help me out.Thanks!
请帮帮我。谢谢!
EDIT :编辑:
using pd.merge使用 pd.merge
df2.merge(df, right_on='company_id', left_on='Excel Company ID', how='outer')
which gave an output with [220 rows X 31 columns]它给出了 [220 行 X 31 列] 的输出
Your expected output is unclear.您的预期输出不清楚。 If you use
pd.merge
with how='outer'
and indicator=True
, you will have:如果您将
pd.merge
与how='outer'
和indicator=True
一起使用,您将拥有:
df1 = df1.rename(columns={'Company Name': 'company_name', 'Excel Company ID': 'company_id'})
out = df2.merge(df1, on=['company_id', 'company_name'], how='outer', indicator=True)
Output:输出:
>>> out
company_id found_keywords no_of_url company_name _merge
0 IQ137156215 insurance 15.0 Zühlke Technology Group AG left_only
1 IQ3806173 insurance 15.0 BT España, Compañía de Servicios Globales de T... both
2 IQ40333012 insurance 4.0 Technoserv Group both
3 IQ51614192 insurance 15.0 Octo Telematics S.p.A. left_only
4 IQ109133656 NaN NaN cleverbridge AG right_only
5 IQ50008102 NaN NaN Blue Media S.A. right_only
6 IQ30413992 NaN NaN zeb.rolfes.schierenbeck.associates gmbh right_only
Check the last column _merge
.检查最后一列
_merge
。 If you have right_only
, it means the company_id
and company_name
are not found in df2
.如果您有
right_only
,则表示在df2
中找不到company_id
和company_name
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.