将一个 dataframe 中的一列与另一个 dataframe pandas 中的许多列进行比较

Question

I have two dataframes:我有两个数据框：

df1: df1：

     ID       name1
0    ''    'company-1'
1    ''    'company2'
2    ''    'company 3'

df2: df2:

     ID      name2       name3        name4
0    '1'   'company1'  'company.1'  'company-1'
1    '2'   'company2'  'company.2'  'company-2'

I want to compare df1['name1'] to the name columns in df2 and put the ID in df2 in the ID column in df1.我想将 df1['name1'] 与 df2 中的名称列进行比较，并将 df2 中的 ID 放在 df1 中的 ID 列中。

I did this:我这样做了：

for i in range(len(df1)):
    for j in range(len(df2)):
        if df1.iloc[i]['name1'] == df2.iloc[j]['name2']:
            df1.iloc[i]['ID'] = df2.iloc[j]['ID']
            break
        elif df1.iloc[i]['name1'] == df2.iloc[j]['name3']:
            df1.iloc[i]['ID'] = df2.iloc[j]['ID']
            break
        elif df1.iloc[i]['name1'] == df2.iloc[j]['name4']:
            df1.iloc[i]['ID'] = df2.iloc[j]['ID']
            break
        else:
            df1[i]['ID'] = ''

Expected result would be:预期结果将是：

     ID       name1
0    '1'    'company-1'
1    '2'    'company2'
2    ''    'company 3'

It works, but it's extremely inneficient and takes up to hours.它可以工作，但效率极低，需要长达数小时。 Can you please help me?你能帮我么？

I'm sorry if the question doesn't meet the required criteria.如果问题不符合要求的标准，我很抱歉。 It's my first time posting here.这是我第一次在这里发帖。 I would love some tips regarding that also.我也喜欢一些关于这方面的建议。

Answer 1

This can be tackled in many ways.这可以通过多种方式解决。 You can use a row-wise apply , convert the second frame into a mapping/lookup table (Python dict ), or try joining the two frames.您可以使用逐行apply ，将第二帧转换为映射/查找表（Python dict ），或尝试连接两个帧。 Here's an example of the latter:这是后者的一个例子：

import pandas as pd

# The given input data
data_1 = {"ID": ["", "", ""], "name1": ["company-1", "company2", "company 3"]}
data_2 = {"ID"   : ["1", "2"], "name2": ["company1", "company2"], "name3": ["company.1", "company.2"],
          "name4": ["company-1", "company-2"]}

df_1 = pd.DataFrame(data_1)
df_2 = pd.DataFrame(data_2)

# Changing the second frame into "long format" and only keeping the "ID" and "potential_matches" variables
unpivoted: pd.DataFrame = df_2.melt("ID", value_name="potential_matches")[["ID", "potential_matches"]]

# Merging and tidyiing up
expected = (df_1
            .merge(unpivoted, how="left", left_on=["name1"], right_on=["potential_matches"])
            .drop(columns=["ID_x", "potential_matches"])
            .rename(columns={"ID_y": "ID"})[["ID", "name1"]])

print(expected)

If performance is still a problem, you can try matching name1 on a multi-index of name2, name3, name4 .如果性能仍然存在问题，您可以尝试在name2, name3, name4的多索引上匹配name1 。

Output Output

ID ID	name1名称1
1 1	company-1公司-1
2 2	company2公司2
nan楠	company 3公司 3

将一个 dataframe 中的一列与另一个 dataframe pandas 中的许多列进行比较

问题描述

1 个解决方案

解决方案1
0 2022-08-09 23:03:54

Output Output

将一个 dataframe 中的一列与另一个 dataframe pandas 中的许多列进行比较

问题描述

1 个解决方案

解决方案1 0 2022-08-09 23:03:54

Output Output

解决方案1
0 2022-08-09 23:03:54