I have 3 dataframes: df1
, df2
, df3
. I am trying to fill NaN
values of df1
with some values contained in df2
. The values selected from df2
are also selected according to the output of a simple function ( mul_val
) who processes some data stored in df3
.
I was able to get such result but I would like to find in a simpler, easier way and more readable code.
Here is what I have so far:
import pandas as pd
import numpy as np
# simple function
def mul_val(a,b):
return a*b
# dataframe 1
data = {'Name':['PINO','PALO','TNCO' ,'TNTO','CUCO' ,'FIGO','ONGF','LABO'],
'Id' :[ 10 , 9 ,np.nan , 14 , 3 ,np.nan, 7 ,np.nan]}
df1 = pd.DataFrame(data)
# dataframe 2
infos = {'Info_a':[10,20,30,40,70,80,90,50,60,80,40,50,20,30,15,11],
'Info_b':[10,30,30,60,10,85,99,50,70,20,30,50,20,40,16,17]}
df2 = pd.DataFrame(infos)
dic = {'Name': {0: 'FIGO', 1: 'TNCO'},
'index': {0: [5, 6], 1: [11, 12, 13]}}
df3 = pd.DataFrame(dic)
#---------------Modify from here in the most efficient way!-----------------
for idx,row in df3.iterrows():
store_val = []
print(row['Name'])
for j in row['index']:
store_val.append([mul_val(df2['Info_a'][j],df2['Info_b'][j]),j])
store_val = np.asarray(store_val)
# - Identify which is the index of minimum value of the first column
indx_min_val = np.argmin(store_val[:,0])
# - Get the value relative number contained in the second column
col_value = row['index'][indx_min_val]
# Identify value to be replaced in df1
value_to_be_replaced = df1['Id'][df1['Name']==row['Name']]
# - Replace such value into the df1 having the same row['Name']
df1['Id'].replace(to_replace=value_to_be_replaced,value=col_value, inplace=True)
By printing store_val
at every iteration I get:
FIGO
[[6800 5]
[8910 6]]
TNCO
[[2500 11]
[ 400 12]
[1200 13]]
Let's do a simple example: considering FIGO
, I identify 6800
as the minimum number between 6800
and 8910
. Therefore I select the number 5
who is placed in df1
. Repeating such operation for the remaining rows of df3
(in this case I have only 2 rows but they could be a lot more), the final result should be like this:
In[0]: before In[0]: after
Out[0]: Out[0]:
Id Name Id Name
0 10.0 PINO 0 10.0 PINO
1 9.0 PALO 1 9.0 PALO
2 NaN TNCO -----> 2 12.0 TNCO
3 14.0 TNTO 3 14.0 TNTO
4 3.0 CUCO 4 3.0 CUCO
5 NaN FIGO -----> 5 5.0 FIGO
6 7.0 ONGF 6 7.0 ONGF
7 NaN LABO 7 NaN LABO
Nore: you can also remove the for loops if needed and use different type of formats to store the data (list, arrays...); the important thing is that the final result is still a dataframe.
I can offer two similar options that achieve the same result than your loop in a couple of lines:
1.Using apply and fillna()
( fillna
is faster than combine_first
by a factor of two):
df3['Id'] = df3.apply(lambda row: (df2.Info_a*df2.Info_b).loc[row['index']].argmin(), axis=1)
df1 = df1.set_index('Name').fillna(df3.set_index('Name')).reset_index()
2.Using a function (lambda doesn't support assignment, so you have to apply a func)
def f(row):
df1.ix[df1.Name==row['Name'], 'Id'] = (df2.Info_a*df2.Info_b).loc[row['index']].argmin()
df3.apply(f, axis=1)
or a slight variant not relying on global definitions:
def f(row, df1, df2):
df1.ix[df1.Name==row['Name'], 'Id'] = (df2.Info_a*df2.Info_b).loc[row['index']].argmin()
df3.apply(f, args=(df1,df2,), axis=1)
Note that your solution, even though much more verbose, will take the least amount of time with this small dataset (7.5 ms versus 9.5 ms for both of mine). It makes sense that the speed would be similar, since in both cases it's a matter of looping on the rows of df3
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.