如何使用 pandas dataframe 连接两个列值在多个列的特定范围内的数据框？

Question

参考此处提出的相同问题join dataframes for single column ，现在我想将其扩展为另外两列，例如：

df1:

price_start  price_end  year_start  year_end  score
         10         50        2001      2005     20
         60        100        2001      2005     50
         10         50        2006      2010     30

df2:

Price  year
   10  2001
   70  2002
   50  2010

现在我想要 map df1 相对于 df2 值的分数。

预计 output：

price year score

10 2001 20

70 2002 50

50 2010 30

Answer 1

解决方案1：小数据集的简单解决方案

对于小数据集，您可以通过.merge()交叉连接df1和df2 ，然后使用.query()指定条件按价格在范围内和年份在范围内的条件进行过滤，如下所示：

(df1.merge(df2, how='cross')
    .query('(Price >= price_start) & (Price <= price_end) & (year >= year_start) & (year <= year_end)')
    [['Price', 'year', 'score']]
)

如果您的 Pandas 版本早于 1.2.0（2020 年 12 月发布）并且不支持与how='cross'合并，您可以使用：

(df1.assign(key=1).merge(df2.assign(key=1), on='key').drop('key', axis=1)
    .query('(Price >= price_start) & (Price <= price_end) & (year >= year_start) & (year <= year_end)')
    [['Price', 'year', 'score']]
)

结果：

   Price  year  score
0     10  2001     20
4     70  2002     50
8     50  2010     30

解决方案2：大数据集的Numpy解决方案

对于大型数据集和性能是一个问题，您可以使用numpy 广播（而不是交叉连接和过滤）来加快执行时间：

我们期待的Price在df2是价格范围内， df1和year中df2在一年范围内df1 ：

d2_P = df2.Price.values
d2_Y = df2.year.values

d1_PS = df1.price_start.values
d1_PE = df1.price_end.values
d1_YS = df1.year_start.values
d1_YE = df1.year_end.values

i, j = np.where((d2_P[:, None] >= d1_PS) & (d2_P[:, None] <= d1_PE) & (d2_Y[:, None] >= d1_YS) & (d2_Y[:, None] <= d1_YE))

pd.DataFrame(
    np.column_stack([df1.values[j], df2.values[i]]),
    columns=df1.columns.append(df2.columns)
)[['Price', 'year', 'score']]

结果：

   Price  year  score
0     10  2001     20
1     70  2002     50
2     50  2010     30

性能比较

第 1 部分：比较每个 3 行的原始数据集：

解决方案1：

%%timeit
(df1.merge(df2, how='cross')
    .query('(Price >= price_start) & (Price <= price_end) & (year >= year_start) & (year <= year_end)')
    [['Price', 'year', 'score']]
)

5.91 ms ± 87.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

解决方案2：

%%timeit
d2_P = df2.Price.values
d2_Y = df2.year.values

d1_PS = df1.price_start.values
d1_PE = df1.price_end.values
d1_YS = df1.year_start.values
d1_YE = df1.year_end.values

i, j = np.where((d2_P[:, None] >= d1_PS) & (d2_P[:, None] <= d1_PE) & (d2_Y[:, None] >= d1_YS) & (d2_Y[:, None] <= d1_YE))

pd.DataFrame(
    np.column_stack([df1.values[j], df2.values[i]]),
    columns=df1.columns.append(df2.columns)
)[['Price', 'year', 'score']]

703 µs ± 9.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

基准测试总结： 5.91 ms vs 703 µs ，快 8.4 倍

第 2 部分：比较具有 3,000 和 30,000 行的数据集：

数据设置：

df1a = pd.concat([df1] * 1000, ignore_index=True)
df2a = pd.concat([df2] * 10000, ignore_index=True)

解决方案1：

%%timeit
(df1a.merge(df2a, how='cross')
    .query('(Price >= price_start) & (Price <= price_end) & (year >= year_start) & (year <= year_end)')
    [['Price', 'year', 'score']]
)

27.5 s ± 3.24 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

解决方案2：

%%timeit
d2_P = df2a.Price.values
d2_Y = df2a.year.values

d1_PS = df1a.price_start.values
d1_PE = df1a.price_end.values
d1_YS = df1a.year_start.values
d1_YE = df1a.year_end.values

i, j = np.where((d2_P[:, None] >= d1_PS) & (d2_P[:, None] <= d1_PE) & (d2_Y[:, None] >= d1_YS) & (d2_Y[:, None] <= d1_YE))

pd.DataFrame(
    np.column_stack([df1a.values[j], df2a.values[i]]),
    columns=df1a.columns.append(df2a.columns)
)[['Price', 'year', 'score']]

3.83 s ± 136 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

基准测试总结： 27.5 s vs 3.83 s ，快 7.2 倍

Answer 2

一种选择是使用pyjanitor中的conditional_join ，它对范围连接也很有效，并且比简单的交叉连接更好：

# pip install pyjanitor
# you can also install the dev version for the latest
# including the ability to use numba for faster performance
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git

import janitor
import pandas as pd

(df1
.conditional_join(
    df2, 
    ('price_start', 'Price', '<='), 
    ('price_end', 'Price', '>='), 
    ('year_start', 'year', '<='), 
    ('year_end', 'year', '>='))
.loc(axis=1)['Price','year','score']
)
   Price  year  score
0     10  2001     20
1     70  2002     50
2     50  2010     30

使用dev版本，您也可以 select 列：

# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git

import janitor
import pandas as pd

(df1
.conditional_join(
    df2, 
    ('price_start', 'Price', '<='), 
    ('price_end', 'Price', '>='), 
    ('year_start', 'year', '<='), 
    ('year_end', 'year', '>='),
    use_numba = False,
    right_columns = ['Price', 'year'],
    df_columns = 'score')
)
   score  Price  year
0     20     10  2001
1     50     70  2002
2     30     50  2010

对于dev版本，如果安装了 numba，则可以打开use_numba以获得更高的性能。

如何使用 pandas dataframe 连接两个列值在多个列的特定范围内的数据框？

问题描述

2 个解决方案

解决方案1
2 已采纳 2021-09-07 11:56:13

解决方案1：小数据集的简单解决方案

解决方案2：大数据集的Numpy解决方案

性能比较

解决方案2
0 2022-09-29 03:05:28

如何使用 pandas dataframe 连接两个列值在多个列的特定范围内的数据框？

问题描述

2 个解决方案

解决方案1 2 已采纳 2021-09-07 11:56:13

解决方案1：小数据集的简单解决方案

解决方案2：大数据集的Numpy解决方案

性能比较

解决方案2 0 2022-09-29 03:05:28

解决方案1
2 已采纳 2021-09-07 11:56:13

解决方案2
0 2022-09-29 03:05:28