[英]How to map series in different DataFrames
I have two dataframes, one holds the bulk of the data of a dataset and the second one holds some additional data that I got at a later point in time. 我有两个数据框,一个保存数据集的大部分数据,第二个保存我在稍后的时间点获得的一些其他数据。
Given the example below, I want to replace the values stored in df_main.b
with the values found in df_additional.b
and I should know which values to use by using the mapping found under column order_id
, present in both dataframes. 考虑下面的例子,我想更换存储在值
df_main.b
与找到的值df_additional.b
,我应该知道哪些值通过下列中找到映射使用order_id
,存在于两个dataframes。
In [385]: df_main = pd.DataFrame({'order_id':['A1', 'A2', 'A3', 'A4', 'A5', 'A6', 'A7'], 'b':[1,2,3,4,5,6,7], 'c':np.random.randn(7), 'd':np.random.randn(7)})
In [386]: df_additional = pd.DataFrame({'order_id':['A1', 'A2', 'A3', 'A5', 'A6', 'A7', 'A8'], 'b':['a','b','c','d','e','f','g']})
In [387]: df_main
Out[387]:
b c d order_id
0 1 0.460474 -1.092239 A1
1 2 0.872538 1.819610 A2
2 3 -0.343626 -2.493006 A3
3 4 0.489427 0.074341 A4
4 5 -1.690572 0.162746 A5
5 6 -0.851540 0.543129 A6
6 7 -0.559258 -0.170457 A7
In [388]: df_additional
Out[388]:
b order_id
0 a A1
1 b A2
2 c A3
3 d A5
4 e A6
5 f A7
6 g A8
Notice how the values in df_main.order_id
are not the same as df_additional.order_id
. 请注意,
df_main.order_id
中的值与df_additional.order_id
。
I would like df_main.b
to turn to np.nan
for these orders that are present in df_main
and not in df_additional
(eg 'A4'
, thus df_main['b'][3]
should turn to np.nan
) 我想
df_main.b
转向np.nan
这些订单存在于df_main
而不是在df_additional
(例如'A4'
,从而df_main['b'][3]
应该转向np.nan
)
I would also like for all those orders that are present in df_additional
and not present in df_main
to be ignored, nothing new to be added in df_main
. 我还希望所有在
df_additional
中存在df_additional
不在df_main
存在的df_main
被忽略,在df_main
中没有新添加的df_main
。
The final output should be: 最终输出应为:
>>> final_version
b c d order_id
0 a 0.460474 -1.092239 A1
1 b 0.872538 1.819610 A2
2 c -0.343626 -2.493006 A3
3 NaN 0.489427 0.074341 A4
4 d -1.690572 0.162746 A5
5 e -0.851540 0.543129 A6
6 f -0.559258 -0.170457 A7
Thanks for helping 感谢您的帮助
Edit I have already tried with np.where()
with the following results: 编辑我已经用
np.where()
尝试了以下结果:
In [389]: df_main.b = np.where(df_main.order_id == df_additional.order_id, df_additional.b, np.nan)
In [390]: df_main
Out[390]:
b c d order_id
0 a 0.460474 -1.092239 A1
1 b 0.872538 1.819610 A2
2 c -0.343626 -2.493006 A3
3 NaN 0.489427 0.074341 A4
4 NaN -1.690572 0.162746 A5
5 NaN -0.851540 0.543129 A6
6 NaN -0.559258 -0.170457 A7
Things go fine until a certain point, but it seems that comparison is made elementwise and therefore fails at some point ( 'A4' != 'A5'
) and from that point on all comparisons fail as well. 事情进展到一定点,但似乎比较是逐元素进行的,因此在某个点(
'A4' != 'A5'
)失败,从那时开始,所有比较也都失败了。 Is it possible to use some form of isin
for all order_id
values in df_main
, get the index and for that index retrieve the b
value (?) 是否有可能使用某种形式的
isin
所有order_id
值df_main
,得到指数和该指数检索b
值(?)
You are looking for merge
: 您正在寻找
merge
:
pd.merge(df_additional, df_main, how='right', on='order_id')
#Out[13]:
# b_x order_id b_y c d
#0 a A1 1 -2.532221 0.702512
#1 b A2 2 2.550224 -0.649286
#2 c A3 3 0.737817 0.999865
#3 d A5 5 -0.484483 1.153589
#4 e A6 6 0.526035 0.335695
#5 f A7 7 -0.901915 -1.312429
#6 NaN A4 4 -0.905911 0.865345
You can use join() if you make an index from order_id column in df_additional 如果您从df_additional中的order_id列创建索引,则可以使用join()
df_additional.set_index('order_id', inplace=True)
df_main.join(df_additional, on='order_id', how='left')
Or if you can make indexes from order_id column on both sides, then there is a simple series assignment. 或者,如果您可以从两侧的order_id列中进行索引,则可以进行简单的序列分配。
df_main.set_index('order_id', inplace=True)
df_additional.set_index('order_id', inplace=True)
df_main['b_add'] = df_additional['b']
If you need an example for the second case, here it is 10 Minutes to pandas 如果您需要第二种情况的示例,这里是熊猫的10分钟
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.