[英]Replace values in dataframe from another dataframe with Pandas
I have 3 dataframes: df1
, df2
, df3
. 我有3个数据帧:
df1
, df2
, df3
。 I am trying to fill NaN
values of df1
with some values contained in df2
. 我试图用
df2
包含的一些值填充df1
NaN
值。 The values selected from df2
are also selected according to the output of a simple function ( mul_val
) who processes some data stored in df3
. 从
df2
中选择的值也是根据处理df3
存储的一些数据的简单函数( mul_val
)的输出来选择的。
I was able to get such result but I would like to find in a simpler, easier way and more readable code. 我能够得到这样的结果,但我希望找到一种更简单,更简单的方法和更易读的代码。
Here is what I have so far: 这是我到目前为止:
import pandas as pd
import numpy as np
# simple function
def mul_val(a,b):
return a*b
# dataframe 1
data = {'Name':['PINO','PALO','TNCO' ,'TNTO','CUCO' ,'FIGO','ONGF','LABO'],
'Id' :[ 10 , 9 ,np.nan , 14 , 3 ,np.nan, 7 ,np.nan]}
df1 = pd.DataFrame(data)
# dataframe 2
infos = {'Info_a':[10,20,30,40,70,80,90,50,60,80,40,50,20,30,15,11],
'Info_b':[10,30,30,60,10,85,99,50,70,20,30,50,20,40,16,17]}
df2 = pd.DataFrame(infos)
dic = {'Name': {0: 'FIGO', 1: 'TNCO'},
'index': {0: [5, 6], 1: [11, 12, 13]}}
df3 = pd.DataFrame(dic)
#---------------Modify from here in the most efficient way!-----------------
for idx,row in df3.iterrows():
store_val = []
print(row['Name'])
for j in row['index']:
store_val.append([mul_val(df2['Info_a'][j],df2['Info_b'][j]),j])
store_val = np.asarray(store_val)
# - Identify which is the index of minimum value of the first column
indx_min_val = np.argmin(store_val[:,0])
# - Get the value relative number contained in the second column
col_value = row['index'][indx_min_val]
# Identify value to be replaced in df1
value_to_be_replaced = df1['Id'][df1['Name']==row['Name']]
# - Replace such value into the df1 having the same row['Name']
df1['Id'].replace(to_replace=value_to_be_replaced,value=col_value, inplace=True)
By printing store_val
at every iteration I get: 通过在每次迭代时打印
store_val
,我得到:
FIGO
[[6800 5]
[8910 6]]
TNCO
[[2500 11]
[ 400 12]
[1200 13]]
Let's do a simple example: considering FIGO
, I identify 6800
as the minimum number between 6800
and 8910
. 让我们举一个简单的例子:考虑到
FIGO
,我将6800
识别为6800
和8910
之间的最小数字。 Therefore I select the number 5
who is placed in df1
. 因此,我选择放在
df1
的数字5
。 Repeating such operation for the remaining rows of df3
(in this case I have only 2 rows but they could be a lot more), the final result should be like this: 对剩余的
df3
行重复这样的操作(在这种情况下我只有2行,但它们可能会更多),最终结果应如下所示:
In[0]: before In[0]: after
Out[0]: Out[0]:
Id Name Id Name
0 10.0 PINO 0 10.0 PINO
1 9.0 PALO 1 9.0 PALO
2 NaN TNCO -----> 2 12.0 TNCO
3 14.0 TNTO 3 14.0 TNTO
4 3.0 CUCO 4 3.0 CUCO
5 NaN FIGO -----> 5 5.0 FIGO
6 7.0 ONGF 6 7.0 ONGF
7 NaN LABO 7 NaN LABO
Nore: you can also remove the for loops if needed and use different type of formats to store the data (list, arrays...); Nore:如果需要,你也可以删除for循环,并使用不同类型的格式来存储数据(列表,数组......); the important thing is that the final result is still a dataframe.
重要的是,最终结果仍然是数据帧。
I can offer two similar options that achieve the same result than your loop in a couple of lines: 我可以提供两个类似的选项,它们可以在几行中实现与循环相同的结果:
1.Using apply and fillna()
( fillna
is faster than combine_first
by a factor of two): 1.使用apply和
fillna()
( fillna
比combine_first
快combine_first
):
df3['Id'] = df3.apply(lambda row: (df2.Info_a*df2.Info_b).loc[row['index']].argmin(), axis=1)
df1 = df1.set_index('Name').fillna(df3.set_index('Name')).reset_index()
2.Using a function (lambda doesn't support assignment, so you have to apply a func) 2.使用函数(lambda不支持赋值,因此你必须应用一个函数)
def f(row):
df1.ix[df1.Name==row['Name'], 'Id'] = (df2.Info_a*df2.Info_b).loc[row['index']].argmin()
df3.apply(f, axis=1)
or a slight variant not relying on global definitions: 或者不依赖于全局定义的轻微变体:
def f(row, df1, df2):
df1.ix[df1.Name==row['Name'], 'Id'] = (df2.Info_a*df2.Info_b).loc[row['index']].argmin()
df3.apply(f, args=(df1,df2,), axis=1)
Note that your solution, even though much more verbose, will take the least amount of time with this small dataset (7.5 ms versus 9.5 ms for both of mine). 请注意,您的解决方案,即使更详细,将花费最少的时间使用这个小数据集(7.5毫秒与我的两个9.5毫秒)。 It makes sense that the speed would be similar, since in both cases it's a matter of looping on the rows of
df3
有意义的是速度是相似的,因为在这两种情况下,它都是在
df3
行上循环的问题
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.