根据 Pandas 中的列内容连接两个 csv 文件

Question

我有两个带有示例数据的大型 CSV 文件，如下所示：

df1 = 
Index    Fruit   Vegetable    
    0    Mango   Spinach
    1    Berry   Carrot
    2    Banana  Cabbage

df2 = 
Index   Unit        Price
   0    Mango       30
   1    Artichoke   45
   2    Banana      12
   3    Berry       10
   4    Cabbage     25
   5    Rice        40
   6    Spinach     34
   7    Carrot      08
   8    Lentil      12
   9    Pot         32

我想创建以下 dataframe：

df3 = 
Index    Fruit   Price      Vegetable    Price   
    0    Mango   30         Spinach      34
    1    Berry   10         Carrot       08   
    2    Banana  12         Cabbage      25

我希望在 df1 中逐行比较每个单位的价格。 如果价格在 5 美元以内，我想将 output 它们放在单独的 dataframe 中，如下所示：

df4 = 
Index    Fruit   Price      Vegetable    Price   
    0    Mango   30         Spinach      34
    1    Berry   10         Carrot       08

实现这一目标的通用方法是什么？ 先感谢您。

Answer 1

您可以使用replace基于df2创建价格 dataframe ，然后join以与原始数据连接。

请注意，不鼓励重复的列名：

# print to see what it does
item_prices = dict(zip(df2.Unit, df2.Price))

out = df1.join(df1.replace(item_prices).add_suffix('_Price')).sort_index(axis=1)

Output：

        Fruit  Fruit_Price Vegetable  Vegetable_Price
Index                                                
0       Mango           30   Spinach               34
1       Berry           10    Carrot                8
2      Banana           12   Cabbage               25

对于下一个问题，您需要 boolean loc 访问权限：

out[abs(out['Fruit_Price'] - out['Vegetable_Price']) < 5]

或query ：

out.query('abs(Fruit_Price-Vegetable_Price)<5')

Output：

       Fruit  Fruit_Price Vegetable  Vegetable_Price
Index                                               
0      Mango           30   Spinach               34
1      Berry           10    Carrot                8

Answer 2

您可以使用双重合并：

fruit = df1[['Fruit']].merge(df2.rename(columns={'Unit': 'Fruit'}), on='Fruit')
veggie = df1[['Vegetable']].merge(df2.rename(columns={'Unit': 'Vegetable'}), on='Vegetable')

df3 = pd.concat([fruit, veggie], axis=1)
print(df3)

# Output:
    Fruit  Price Vegetable  Price
0   Mango     30   Spinach     34
1   Berry     10    Carrot      8
2  Banana     12   Cabbage     25

然后

df4 = df3[np.abs(np.subtract(*out['Price'].values.T)) <= 5]
print(df4)

# Output:
   Fruit  Price Vegetable  Price
0  Mango     30   Spinach     34
1  Berry     10    Carrot      8

Answer 3

一种通用的替代方法（可以处理任意数量的类别）是在之前（使用melt ）和之后（使用pivot ）重塑。 这具有创建一个 MultiIndex 的优势，该 MultiIndex 可以非常方便地显式识别价格类别：

out = (df1.melt(id_vars='Index', value_name='Unit')
          .merge(df2.drop(columns='Index'), on='Unit')
          .pivot(index='Index', columns='variable', values=['Unit', 'Price'])
       )

output：

            Unit           Price          
variable   Fruit Vegetable Fruit Vegetable
Index                                     
0          Mango   Spinach    30        34
1          Berry    Carrot    10         8
2         Banana   Cabbage    12        25

对 diff ≤ 5 的行进行子集化：

out[out['Price'].diff(axis=1).abs().le(5).any(1)]

output：

           Unit           Price          
variable  Fruit Vegetable Fruit Vegetable
Index                                    
0         Mango   Spinach    30        34
1         Berry    Carrot    10         8

根据 Pandas 中的列内容连接两个 csv 文件

问题描述

3 个解决方案

解决方案1
1 已采纳 2021-12-13 20:00:25

解决方案2
1 2021-12-13 20:00:31

解决方案3
1 2021-12-13 20:10:25

对 diff ≤ 5 的行进行子集化：

根据 Pandas 中的列内容连接两个 csv 文件

问题描述

3 个解决方案

解决方案1 1 已采纳 2021-12-13 20:00:25

解决方案2 1 2021-12-13 20:00:31

解决方案3 1 2021-12-13 20:10:25

对 diff ≤ 5 的行进行子集化：

解决方案1
1 已采纳 2021-12-13 20:00:25

解决方案2
1 2021-12-13 20:00:31

解决方案3
1 2021-12-13 20:10:25