[英]Compare a column in one dataframe with many columns in another dataframe pandas
I have two dataframes:我有两个数据框:
df1: df1:
ID name1
0 '' 'company-1'
1 '' 'company2'
2 '' 'company 3'
df2: df2:
ID name2 name3 name4
0 '1' 'company1' 'company.1' 'company-1'
1 '2' 'company2' 'company.2' 'company-2'
I want to compare df1['name1'] to the name columns in df2 and put the ID in df2 in the ID column in df1.我想将 df1['name1'] 与 df2 中的名称列进行比较,并将 df2 中的 ID 放在 df1 中的 ID 列中。
I did this:我这样做了:
for i in range(len(df1)):
for j in range(len(df2)):
if df1.iloc[i]['name1'] == df2.iloc[j]['name2']:
df1.iloc[i]['ID'] = df2.iloc[j]['ID']
break
elif df1.iloc[i]['name1'] == df2.iloc[j]['name3']:
df1.iloc[i]['ID'] = df2.iloc[j]['ID']
break
elif df1.iloc[i]['name1'] == df2.iloc[j]['name4']:
df1.iloc[i]['ID'] = df2.iloc[j]['ID']
break
else:
df1[i]['ID'] = ''
Expected result would be:预期结果将是:
ID name1
0 '1' 'company-1'
1 '2' 'company2'
2 '' 'company 3'
It works, but it's extremely inneficient and takes up to hours.它可以工作,但效率极低,需要长达数小时。 Can you please help me?你能帮我么?
I'm sorry if the question doesn't meet the required criteria.如果问题不符合要求的标准,我很抱歉。 It's my first time posting here.这是我第一次在这里发帖。 I would love some tips regarding that also.我也喜欢一些关于这方面的建议。
This can be tackled in many ways.这可以通过多种方式解决。 You can use a row-wise apply
, convert the second frame into a mapping/lookup table (Python dict
), or try joining the two frames.您可以使用逐行apply
,将第二帧转换为映射/查找表(Python dict
),或尝试连接两个帧。 Here's an example of the latter:这是后者的一个例子:
import pandas as pd
# The given input data
data_1 = {"ID": ["", "", ""], "name1": ["company-1", "company2", "company 3"]}
data_2 = {"ID" : ["1", "2"], "name2": ["company1", "company2"], "name3": ["company.1", "company.2"],
"name4": ["company-1", "company-2"]}
df_1 = pd.DataFrame(data_1)
df_2 = pd.DataFrame(data_2)
# Changing the second frame into "long format" and only keeping the "ID" and "potential_matches" variables
unpivoted: pd.DataFrame = df_2.melt("ID", value_name="potential_matches")[["ID", "potential_matches"]]
# Merging and tidyiing up
expected = (df_1
.merge(unpivoted, how="left", left_on=["name1"], right_on=["potential_matches"])
.drop(columns=["ID_x", "potential_matches"])
.rename(columns={"ID_y": "ID"})[["ID", "name1"]])
print(expected)
If performance is still a problem, you can try matching name1
on a multi-index of name2, name3, name4
.如果性能仍然存在问题,您可以尝试在name2, name3, name4
的多索引上匹配name1
。
ID ID | name1名称1 |
---|---|
1 1 | company-1公司-1 |
2 2 | company2公司2 |
nan楠 | company 3公司 3 |
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.