在两个 DataFrame 之间执行多个 Pandas 查找的最佳方法

Question

我正在尝试使用两个数据帧进行简单的查找，使用 Pandas。 我有一个主要的主数据框（左）和一个查找数据框（右）。 我想在匹配的 integer 代码上加入他们，并从item_df返回项目title 。

我看到了一个带有键值对想法的轻微解决方案，但它似乎很麻烦。 我的想法是使用col3和name作为关键列将数据框merge在一起，并保留我想要的正确框架中的value ，即title 。 因此，我决定drop我加入的key列，所以我剩下的就是value 。 现在假设我想用我自己的手动命名约定多次执行此操作。 为此，我使用rename命名我合并的值。现在我将重复此合并操作并将我的下一个连接重命名为second_title类的名称（参见下面的示例）。

有没有一种不那么繁琐的方法来执行这个重复的操作，而不会不断地删除合并的额外列并在每个合并步骤之间重命名新列？

下面的示例代码：

import pandas as pd

master_dict: dict = {'col1': [3,4,8,10], 'col2': [5,6,9,10], 'col3': [50,55,59,60]}
master_df: pd.DataFrame = pd.DataFrame(master_dict)
item_dict: dict = {'name': [55,59,50,5,6,7], 'title': ['p1','p2','p3','p4','p5','p6']}
item_df: pd.DataFrame = pd.DataFrame(item_dict)
    
print(master_df.head())
   col1  col2  col3
0     3     5    50
1     4     6    55
2     8     9    59
3    10    10    60
print(item_df.head())
   name title
0    55    p1
1    59    p2
2    50    p3
3     5    p4
4     6    p5

# merge on col3 and name
combined_df = pd.merge(master_df, item_df, how = 'left', left_on = 'col3', right_on = 'name')
# rename title to "first_title"
combined_df.rename(columns = {'title':'first_title'}, inplace = True)
combined_df.drop(columns = ['name'], inplace = True) # remove 'name' column that was joined in from right frame
# repeat operation for "second_title"
combined_df = pd.merge(combined_df, item_df, how = 'left', left_on = 'col2', right_on = 'name')
combined_df.rename(columns = {'title': 'second_title'}, inplace = True)
combined_df.drop(columns = ['name'], inplace = True)
print(combined_df.head())
   col1  col2  col3 first_title second_title
0     3     5    50          p3           p4
1     4     6    55          p1           p5
2     8     9    59          p2          NaN
3    10    10    60         NaN          NaN

Answer 1

我们可以将您的键：值映射与map function 一起使用：

下面的代码获取分别位于master_df col3 和 col2 中的 item_df name列的值字典。

col3 = dict(zip(*(value for _, value in
                  item_df[item_df.name.isin(master_df.col3)].items()))
           )

col2 = dict(zip(*(value for _, value in
                 item_df[item_df.name.isin(master_df.col2)].items()))
           )


col3
{55: 'p1', 59: 'p2', 50: 'p3'}

col2
{5: 'p4', 6: 'p5'}

接下来是使用assign并创建列 first_title 和 second_title：

master_df.assign(
    first_title=master_df.col3.map(col3),
    second_title=master_df.col2.map(col2)
    )



   col1 col2    col3    first_title second_title
0   3   5       50      p3            p4
1   4   6       55      p1            p5
2   8   9       59      p2            NaN
3   10  10      60      NaN           NaN

更新

我考虑了您对单个字典的评论，并且似乎可以通过使用系列来实现。 这将大大减少我之前分享的臃肿代码。 在这种情况下，我们将item_df转换为系列，并将 map 转换为每个相关列：

item_df = item_df.set_index("name").loc[:, "title"]

item_df

name
55    p1
59    p2
50    p3
5     p4
6     p5
7     p6
Name: title, dtype: object

现在使用分配 function 创建您的特定列：

master_df.assign(first_title=master_df.col3.map(item_df), 
                 second_title=master_df.col2.map(item_df)
                 )

   col1 col2    col3    first_title second_title
0   3   5       50      p3            p4
1   4   6       55      p1            p5
2   8   9       59      p2            NaN
3   10  10      60      NaN           NaN

简单得多，直截了当。

在两个 DataFrame 之间执行多个 Pandas 查找的最佳方法

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-11-28 00:36:17

在两个 DataFrame 之间执行多个 Pandas 查找的最佳方法

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-11-28 00:36:17

解决方案1
1 已采纳 2020-11-28 00:36:17