[英]How to subtract rows based on matching column in Pandas?
If I have two dataframes, like these in the example created with: 如果我有两个数据帧,就像创建的示例中的这些数据框一样:
df1 = pd.DataFrame({'A': randint(1,11,10), 'B': randint(10,100,10), 'C': randint(100,1000,10)})
df2 = pd.DataFrame({'A': randint(1,11,10), 'B': randint(10,100,10), 'C': randint(100,1000,10)})
df2 = df2.drop_duplicates(subset=['A'])
df1 DF1
A B C
0 2 96 826
1 1 64 601
2 1 27 343
3 5 65 600
4 10 68 658
5 6 81 895
6 5 73 440
7 4 54 865
8 1 24 597
9 10 66 928
df2 DF2
A B C
0 2 87 669
1 5 99 594
2 6 50 989
3 10 46 767
4 3 56 828
5 4 83 415
6 1 41 332
How can I subtract columns B (df['B'] - df2['B']) only if the values from column A are matching? 只有当A列的值匹配时,如何减去B列(df ['B'] - df2 ['B'])? So I can get a new column in df1 like: 所以我可以在df1中获得一个新列,如:
9
23
-14
-34
22
31
-26
-29
-17
20
To get the values you want to subtract, take df1['A']
and map the values of df2['B']
to it by indexing df2['B']
with df2['A']
: 要获取要减去的值,请取df1['A']
并通过使用df2['A']
索引df2['B']
将df2['B']
的值映射到它:
df1['new'] = df1['B'] - df1['A'].map(df2.set_index('A')['B'])
The resulting output: 结果输出:
A B C new
0 2 96 826 9
1 1 64 601 23
2 1 27 343 -14
3 5 65 600 -34
4 10 68 658 22
5 6 81 895 31
6 5 73 440 -26
7 4 54 865 -29
8 1 24 597 -17
9 10 66 928 20
Edit 编辑
For smaller datasets, it may be slightly faster to supply a dictionary to map
. 对于较小的数据集,提供字典进行map
可能稍微快一些。
Timings on the example dataset: 示例数据集上的计时:
%timeit df1.B - df1.A.map(df2.set_index('A').B)
%timeit df1.B - df1.A.map(dict(zip(df2.A, df2.B)))
%timeit df1.B - df1.A.map(dict(zip(df2.A.values, df2.B.values)))
1000 loops, best of 3: 718 µs per loop
1000 loops, best of 3: 492 µs per loop
1000 loops, best of 3: 459 µs per loop
For larger datasets, using the index method appears to be faster. 对于较大的数据集,使用索引方法似乎更快。
Larger dataset setup: 更大的数据集设置:
rows, a_max, b_max, c_max = 10**6, 5*10**4, 10**5, 10**5
df1 = pd.DataFrame({'A': randint(1, a_max, rows), 'B': randint(10, b_max, rows), 'C': randint(100, c_max, rows)})
df2 = pd.DataFrame({'A': randint(1, a_max, rows), 'B': randint(10, b_max, rows), 'C': randint(100, c_max, rows)})
df2 = df2.drop_duplicates(subset=['A'])
Timings on the larger dataset: 更大数据集上的计时:
%timeit df1.B - df1.A.map(df2.set_index('A').B)
%timeit df1.B - df1.A.map(dict(zip(df2.A, df2.B)))
%timeit df1.B - df1.A.map(dict(zip(df2.A.values, df2.B.values)))
10 loops, best of 3: 114 ms per loop
10 loops, best of 3: 359 ms per loop
10 loops, best of 3: 354 ms per loop
Try this: 尝试这个:
In [61]: df1['new'] = df1.drop('C',1).merge(df2.drop('C',1), on='A',
how='left', suffixes=['','2']) \
.eval("new=B-B2", inplace=False)['new']
In [62]: df1
Out[62]:
A B C new
0 2 96 826 9
1 1 64 601 23
2 1 27 343 -14
3 5 65 600 -34
4 10 68 658 22
5 6 81 895 31
6 5 73 440 -26
7 4 54 865 -29
8 1 24 597 -17
9 10 66 928 20
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.