简体   繁体   English

如何在不同的DataFrames中映射系列

[英]How to map series in different DataFrames

I have two dataframes, one holds the bulk of the data of a dataset and the second one holds some additional data that I got at a later point in time. 我有两个数据框,一个保存数据集的大部分数据,第二个保存我在稍后的时间点获得的一些其他数据。

Given the example below, I want to replace the values stored in df_main.b with the values found in df_additional.b and I should know which values to use by using the mapping found under column order_id , present in both dataframes. 考虑下面的例子,我想更换存储在值df_main.b与找到的值df_additional.b ,我应该知道哪些值通过下列中找到映射使用order_id ,存在于两个dataframes。

In [385]: df_main = pd.DataFrame({'order_id':['A1', 'A2', 'A3', 'A4', 'A5', 'A6', 'A7'], 'b':[1,2,3,4,5,6,7], 'c':np.random.randn(7), 'd':np.random.randn(7)})

In [386]: df_additional = pd.DataFrame({'order_id':['A1', 'A2', 'A3', 'A5', 'A6', 'A7', 'A8'], 'b':['a','b','c','d','e','f','g']})

In [387]: df_main
Out[387]: 
   b         c         d order_id
0  1  0.460474 -1.092239       A1
1  2  0.872538  1.819610       A2
2  3 -0.343626 -2.493006       A3
3  4  0.489427  0.074341       A4
4  5 -1.690572  0.162746       A5
5  6 -0.851540  0.543129       A6
6  7 -0.559258 -0.170457       A7

In [388]: df_additional
Out[388]: 
   b order_id
0  a       A1
1  b       A2
2  c       A3
3  d       A5
4  e       A6
5  f       A7
6  g       A8

Notice how the values in df_main.order_id are not the same as df_additional.order_id . 请注意, df_main.order_id中的值与df_additional.order_id

I would like df_main.b to turn to np.nan for these orders that are present in df_main and not in df_additional (eg 'A4' , thus df_main['b'][3] should turn to np.nan ) 我想df_main.b转向np.nan这些订单存在于df_main而不是在df_additional (例如'A4' ,从而df_main['b'][3]应该转向np.nan

I would also like for all those orders that are present in df_additional and not present in df_main to be ignored, nothing new to be added in df_main . 我还希望所有在df_additional中存在df_additional不在df_main存在的df_main被忽略,在df_main中没有新添加的df_main

The final output should be: 最终输出应为:

>>> final_version
   b            c         d order_id
0  a     0.460474 -1.092239       A1
1  b     0.872538  1.819610       A2
2  c    -0.343626 -2.493006       A3
3  NaN   0.489427  0.074341       A4
4  d    -1.690572  0.162746       A5
5  e    -0.851540  0.543129       A6
6  f    -0.559258 -0.170457       A7

Thanks for helping 感谢您的帮助

Edit I have already tried with np.where() with the following results: 编辑我已经用np.where()尝试了以下结果:

In [389]: df_main.b = np.where(df_main.order_id == df_additional.order_id, df_additional.b, np.nan)

In [390]: df_main
Out[390]: 
     b         c         d order_id
0    a  0.460474 -1.092239       A1
1    b  0.872538  1.819610       A2
2    c -0.343626 -2.493006       A3
3  NaN  0.489427  0.074341       A4
4  NaN -1.690572  0.162746       A5
5  NaN -0.851540  0.543129       A6
6  NaN -0.559258 -0.170457       A7

Things go fine until a certain point, but it seems that comparison is made elementwise and therefore fails at some point ( 'A4' != 'A5' ) and from that point on all comparisons fail as well. 事情进展到一定点,但似乎比较是逐元素进行的,因此在某个点( 'A4' != 'A5' )失败,从那时开始,所有比较也都失败了。 Is it possible to use some form of isin for all order_id values in df_main , get the index and for that index retrieve the b value (?) 是否有可能使用某种形式的isin所有order_iddf_main ,得到指数和该指数检索b值(?)

You are looking for merge : 您正在寻找merge

pd.merge(df_additional, df_main, how='right', on='order_id')

#Out[13]:
#   b_x order_id  b_y         c         d
#0    a       A1    1 -2.532221  0.702512
#1    b       A2    2  2.550224 -0.649286
#2    c       A3    3  0.737817  0.999865
#3    d       A5    5 -0.484483  1.153589
#4    e       A6    6  0.526035  0.335695
#5    f       A7    7 -0.901915 -1.312429
#6  NaN       A4    4 -0.905911  0.865345

You can use join() if you make an index from order_id column in df_additional 如果您从df_additional中的order_id列创建索引,则可以使用join()

df_additional.set_index('order_id', inplace=True)    
df_main.join(df_additional, on='order_id', how='left')

Or if you can make indexes from order_id column on both sides, then there is a simple series assignment. 或者,如果您可以从两侧的order_id列中进行索引,则可以进行简单的序列分配。

df_main.set_index('order_id', inplace=True)
df_additional.set_index('order_id', inplace=True)    
df_main['b_add'] = df_additional['b']

If you need an example for the second case, here it is 10 Minutes to pandas 如果您需要第二种情况的示例,这里是熊猫的10分钟

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何比较两个不同数据帧中的系列值并插入新值? - How to compare Series values in two different Dataframes and insert new value? 比较/映射不同数据框中的不同序列 - Comparing/Mapping different series in different Dataframes 如何使用来自两个不同 DataFrame 的信息在 folium 中创建地图 - How to create a map in folium using information from two different DataFrames 如何在 Python 中的单个 map 中显示两个不同的坐标数据框 - How to display two different coordinate dataframes in a single map in Python 如何合并两个具有不同结束日期的时间序列数据框并保持较长的结束日期 - How to merge two time series dataframes with different end dates and keep the longer end date 如何在PANDAS中对具有不同索引的数据帧或系列进行计算? - How can I do computations on dataframes or series that have different indexes in PANDAS? 如何从熊猫的一系列数据框中删除空数据框? - How can I remove empty dataframes from a series of dataframes in pandas? 如何在熊猫中将这一系列数据帧转换为时间序列? - How do I turn this series of dataframes into a time series in Pandas? 如何使用 PANDAS 将一个 dataframe 的 map 值转换为不同长度的第二个数据帧 - How to map values of one dataframe to asecond dataframes of different length using PANDAS 如何将包含不同位置的时间序列数据的多个 Pandas 数据帧合并到一个 X 数组中? - How can I combine multiple Pandas dataframes that contain time series data for different locations, into a single X-array?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM