简体   繁体   English

Pandas - 在多个列上有条件地合并数据帧

[英]Pandas - merging dataframes conditionally on multiple columns

I have 2 dataframes and I want to take one of the columns from one and create a new column in the second based on values in multiple (other) columns 我有2个数据帧,我想从一个列中获取一个列,并根据多个(其他)列中的值在第二个列中创建一个新列

First dataframe ( df1 ): 第一个数据帧( df1 ):

df1 = pd.DataFrame({'cond': np.repeat([1,2], 5),
                    'point': np.tile(np.arange(1,6), 2),
                    'value1': np.random.rand(10),
                    'unused1': np.random.rand(10)})

   cond  point   unused1    value1
0     1      1  0.923699  0.103046
1     1      2  0.046528  0.188408
2     1      3  0.677052  0.481349
3     1      4  0.464000  0.807454
4     1      5  0.180575  0.962032
5     2      1  0.941624  0.437961
6     2      2  0.489738  0.026166
7     2      3  0.739453  0.109630
8     2      4  0.338997  0.415101
9     2      5  0.310235  0.660748

and the second ( df2 ): 和第二个( df2 ):

df2 = pd.DataFrame({'cond': np.repeat([1,2], 10),
                    'point': np.tile(np.arange(1,6), 4),
                    'value2': np.random.rand(20)})

    cond  point    value2
0      1      1  0.990252
1      1      2  0.534813
2      1      3  0.407325
3      1      4  0.969288
4      1      5  0.085832
5      1      1  0.922026
6      1      2  0.567615
7      1      3  0.174402
8      1      4  0.469556
9      1      5  0.511182
10     2      1  0.219902
11     2      2  0.761498
12     2      3  0.406981
13     2      4  0.551322
14     2      5  0.727761
15     2      1  0.075048
16     2      2  0.159903
17     2      3  0.726013
18     2      4  0.848213
19     2      5  0.284404

df1['value1'] contains values for each combination of cond and point . df1['value1']包含condpoint每个组合的point

I want to create a new column ( new_column ) in df2 that contains values from df1['value1'] , but the values should be the ones where cond and point are matching across the 2 dataframes. 我想在df2中创建一个包含来自df1['value1']的值的新列( new_column ),但这些值应该是condpoint在2个数据帧中匹配的值。

So my desired output looks like this: 所以我想要的输出看起来像这样:

    cond  point    value2  new_column
0      1      1  0.990252    0.103046
1      1      2  0.534813    0.188408
2      1      3  0.407325    0.481349
3      1      4  0.969288    0.807454
4      1      5  0.085832    0.962032
5      1      1  0.922026    0.103046
6      1      2  0.567615    0.188408
7      1      3  0.174402    0.481349
8      1      4  0.469556    0.807454
9      1      5  0.511182    0.962032
10     2      1  0.219902    0.437961
11     2      2  0.761498    0.026166
12     2      3  0.406981    0.109630
13     2      4  0.551322    0.415101
14     2      5  0.727761    0.660748
15     2      1  0.075048    0.437961
16     2      2  0.159903    0.026166
17     2      3  0.726013    0.109630
18     2      4  0.848213    0.415101
19     2      5  0.284404    0.660748

In this example I could just use tile/repeat, but in reality df1['value1'] doesn't fit so neatly into the other dataframe. 在这个例子中,我可以使用tile / repeat,但实际上df1['value1']不能很好地适应其他数据帧。 So I just need to do it based on matching the cond and point columns 所以我只需要在匹配condpoint列的基础上进行

I've tried merging them, but 1) the numbers dont seem to match and 2) I dont want to bring over any unused columns from df1 : 我尝试合并它们,但1)数字似乎不匹配2)我不想从df1带来任何未使用的列:

df1.merge(df2, left_on=['cond', 'point'], right_on=['cond', 'point'])

Whats the correct way to add this new column without having to iterate through the 2 dataframes? 什么是添加这个新列的正确方法,而不必迭代2个数据帧?

Option 1 选项1
For grace and speed with pure pandas , we can use lookup 对于纯pandas优雅和速度,我们可以使用lookup
This will produce the same output as all other options, seen below. 这将产生与所有其他选项相同的输出,如下所示。

The concept is to represent the lookup data as a 2-D array and lookup values with the indices. 该概念是将查找数据表示为2-D数组并使用索引查找值。

d1 = df1.set_index(['cond', 'point']).value1.unstack()
df2.assign(new_column=d1.lookup(df2.cond, df2.point))

Option 2 选项2
We can do the same thing with numpy to improve performance if the values are presented in the same way they are in df1 . 如果值以与df1相同的方式呈现,我们可以使用numpy来提高性能。 This is very fast! 这非常快!

a = df1.value1.values.reshape(2, -1)
df2.assign(new_column=a[df2.cond.values - 1, df2.point.values - 1])

Option 3 选项3
The canonical answer is to use merge with the left parameter 规范的答案是使用left参数merge
But we'll need to prep df1 a bit to nail the output 但是我们需要准备一点df1来确定输出

d1 = df1[['cond', 'point', 'value1']].rename(columns={'value1': 'new_column'})
df2.merge(d1, 'left')

Option 4 选项4
I thought this was fun. 我觉得这很有趣。 Build a mapping dictionary and a series to map on 构建映射字典和要映射的系列
Good for small data, not so good for large data. 适用于小数据,对大数据不太好。 See timing below. 见下面的时间。

c1 = df1.cond.values.tolist()
p1 = df1.point.values.tolist()
v1 = df1.value1.values.tolist()
m = {(c, p): v for c, p, v in zip(c1, p1, v1)}

c2 = df2.cond.values.tolist()
p2 = df2.point.values.tolist()
i2 = df2.index.values.tolist()
s2 = pd.Series({i: (c, p) for i, c, p in zip(i2, c2, p2)})

df2.assign(new_column=s2.map(m))

OUTPUT OUTPUT

    cond  point    value2  new_column
0      1      1  0.990252    0.103046
1      1      2  0.534813    0.188408
2      1      3  0.407325    0.481349
3      1      4  0.969288    0.807454
4      1      5  0.085832    0.962032
5      1      1  0.922026    0.103046
6      1      2  0.567615    0.188408
7      1      3  0.174402    0.481349
8      1      4  0.469556    0.807454
9      1      5  0.511182    0.962032
10     2      1  0.219902    0.437961
11     2      2  0.761498    0.026166
12     2      3  0.406981    0.109630
13     2      4  0.551322    0.415101
14     2      5  0.727761    0.660748
15     2      1  0.075048    0.437961
16     2      2  0.159903    0.026166
17     2      3  0.726013    0.109630
18     2      4  0.848213    0.415101
19     2      5  0.284404    0.660748

Timing 定时
small data 小数据

%%timeit 
a = df1.value1.values.reshape(2, -1)
df2.assign(new_column=a[df2.cond.values - 1, df2.point.values - 1])
1000 loops, best of 3: 304 µs per loop

%%timeit
d1 = df1.set_index(['cond', 'point']).value1.unstack()
df2.assign(new_column=d1.lookup(df2.cond, df2.point))
100 loops, best of 3: 1.8 ms per loop

%%timeit
c1 = df1.cond.values.tolist()
p1 = df1.point.values.tolist()
v1 = df1.value1.values.tolist()
m = {(c, p): v for c, p, v in zip(c1, p1, v1)}
​
c2 = df2.cond.values.tolist()
p2 = df2.point.values.tolist()
i2 = df2.index.values.tolist()
s2 = pd.Series({i: (c, p) for i, c, p in zip(i2, c2, p2)})
​
df2.assign(new_column=s2.map(m))
1000 loops, best of 3: 719 µs per loop

%%timeit
d1 = df1[['cond', 'point', 'value1']].rename(columns={'value1': 'new_column'})
df2.merge(d1, 'left')
100 loops, best of 3: 2.04 ms per loop

%%timeit
df = pd.merge(df2, df1.drop('unused1', axis=1), 'left')
df.rename(columns={'value1': 'new_column'})
100 loops, best of 3: 2.01 ms per loop

%%timeit
df = df2.join(df1.drop('unused1', axis=1).set_index(['cond', 'point']), on=['cond', 'point'])
df.rename(columns={'value1': 'new_column'})
100 loops, best of 3: 2.15 ms per loop

large data 大数据

df2 = pd.concat([df2] * 10000, ignore_index=True)

%%timeit 
a = df1.value1.values.reshape(2, -1)
df2.assign(new_column=a[df2.cond.values - 1, df2.point.values - 1])
1000 loops, best of 3: 1.93 ms per loop

%%timeit
d1 = df1.set_index(['cond', 'point']).value1.unstack()
df2.assign(new_column=d1.lookup(df2.cond, df2.point))
100 loops, best of 3: 5.58 ms per loop

%%timeit
c1 = df1.cond.values.tolist()
p1 = df1.point.values.tolist()
v1 = df1.value1.values.tolist()
m = {(c, p): v for c, p, v in zip(c1, p1, v1)}
​
c2 = df2.cond.values.tolist()
p2 = df2.point.values.tolist()
i2 = df2.index.values.tolist()
s2 = pd.Series({i: (c, p) for i, c, p in zip(i2, c2, p2)})
​
df2.assign(new_column=s2.map(m))
10 loops, best of 3: 135 ms per loop

%%timeit
d1 = df1[['cond', 'point', 'value1']].rename(columns={'value1': 'new_column'})
df2.merge(d1, 'left')
100 loops, best of 3: 13.4 ms per loop

%%timeit
df = pd.merge(df2, df1.drop('unused1', axis=1), 'left')
df.rename(columns={'value1': 'new_column'})
10 loops, best of 3: 19.8 ms per loop

%%timeit
df = df2.join(df1.drop('unused1', axis=1).set_index(['cond', 'point']), on=['cond', 'point'])
df.rename(columns={'value1': 'new_column'})
100 loops, best of 3: 18.2 ms per loop

You can use merge with left join and drop for remove unused1 column, last rename column: 您可以使用merge with left joindrop删除unused1列,最后rename列:

Notice: Parameter on can be omit if in both DataFrames are only same columns for join. 注意:如果两个DataFrames中只有相同的连接列,则可以省略参数on If more same column names, add on=['cond', 'point'] . 如果列名更相同,请添加on=['cond', 'point']

df = pd.merge(df2, df1.drop('unused1', axis=1), 'left')
df = df.rename(columns={'value1': 'new_column'})
print (df)
    cond  point    value2  new_column
0      1      1  0.990252    0.103046
1      1      2  0.534813    0.188408
2      1      3  0.407325    0.481349
3      1      4  0.969288    0.807454
4      1      5  0.085832    0.962032
5      1      1  0.922026    0.103046
6      1      2  0.567615    0.188408
7      1      3  0.174402    0.481349
8      1      4  0.469556    0.807454
9      1      5  0.511182    0.962032
10     2      1  0.219902    0.437961
11     2      2  0.761498    0.026166
12     2      3  0.406981    0.109630
13     2      4  0.551322    0.415101
14     2      5  0.727761    0.660748
15     2      1  0.075048    0.437961
16     2      2  0.159903    0.026166
17     2      3  0.726013    0.109630
18     2      4  0.848213    0.415101
19     2      5  0.284404    0.660748

Another solution with join (default left join ) with set_index + drop : 另一个带有set_index + drop join (默认left join )解决方案:

df = df2.join(df1.drop('unused1', axis=1).set_index(['cond', 'point']), on=['cond', 'point'])
df = df.rename(columns={'value1': 'new_column'})
print (df)
    cond  point    value2  new_column
0      1      1  0.990252    0.103046
1      1      2  0.534813    0.188408
2      1      3  0.407325    0.481349
3      1      4  0.969288    0.807454
4      1      5  0.085832    0.962032
5      1      1  0.922026    0.103046
6      1      2  0.567615    0.188408
7      1      3  0.174402    0.481349
8      1      4  0.469556    0.807454
9      1      5  0.511182    0.962032
10     2      1  0.219902    0.437961
11     2      2  0.761498    0.026166
12     2      3  0.406981    0.109630
13     2      4  0.551322    0.415101
14     2      5  0.727761    0.660748
15     2      1  0.075048    0.437961
16     2      2  0.159903    0.026166
17     2      3  0.726013    0.109630
18     2      4  0.848213    0.415101
19     2      5  0.284404    0.660748

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM