[英]Pandas DataFrame - Fill NaNs of columns based on values of other columns
I have a wide data frame with several years:我有几年的广泛数据框:
df = pd.DataFrame(index=pd.Index([29925, 223725, 280165, 813285, 956765], name='ID'),
columns=pd.Index([1991, 1992, 1993, 1994, 1995, 1996, '2010-2012'], name='Year'),
data = np.array([[np.NaN, np.NaN, 16, 17, 18, 19, np.NaN],
[16, 17, 18, 19, 20, 21, np.NaN],
[np.NaN, np.NaN, np.NaN, np.NaN, 16, 17, 31],
[np.NaN, 22, 23, 24, np.NaN, 26, np.NaN],
[36, 36, 37, 38, 39, 40, 55]]))
Year 1991 1992 1993 1994 1995 1996 2010-2012
ID
29925 NaN NaN 16.0 17.0 18.0 19.0 NaN
223725 16.0 17.0 18.0 19.0 20.0 21.0 NaN
280165 NaN NaN NaN NaN 16.0 17.0 31.0
813285 NaN 22.0 23.0 24.0 NaN 26.0 NaN
956765 36.0 36.0 37.0 38.0 39.0 40.0 55.0
The values in each row are the age of each person, with each holding a unique ID.每行中的值是每个人的年龄,每个人都有一个唯一的 ID。 I want to fill the
NaN
of this data frame in each year of every row, based on the existing age values in each row.我想根据每行中的现有年龄值在每一行的每一年中填充此数据框的
NaN
。
For example, ID 29925
is 16 in 1993
, we know they are 15 in 1992
and 14 in 1991
, therefore we want to replace the NaN
for 29925
in the columns 1992
and 1991
.例如,ID
29925
在1993
中是 16 ,我们知道它们在1992
是 15 ,在1991
是 14 ,因此我们想在1992
和1991
列中将NaN
替换为29925
。 Similarly, I want to replace the NaN
in the column 2010-2012
based on the existing age values for 29925
.同样,我想根据
29925
的现有年龄值替换2010-2012
列中的NaN
。 Let's assume that 29925
is 15 years older from 1996
in the 2010-2012
column.假设
29925
在2010-2012
列中比1996
年大 15 年。 What is the fastest way to do this for the whole data frame - ie for all IDs?对整个数据框(即所有 ID)执行此操作的最快方法是什么?
# imports we need later
import numpy as np
import pandas as pd
This is a not a particularly efficient method but it works.这不是一种特别有效的方法,但它确实有效。 I'll leave out your last column, to make things more systematic.
我将省略你的最后一个专栏,以使事情更系统化。
The df
: df
:
df = pd.DataFrame(index=pd.Index([29925, 223725, 280165, 813285, 956765], name='ID'),
columns=pd.Index([1992, 1992, 1993, 1994, 1995, 1996], name='Year'),
data = np.array([[np.NaN, np.NaN, 16, 17, 18, 19],
[16, 17, 18, 19, 20, 21],
[np.NaN, np.NaN, np.NaN, np.NaN, 16, 17],
[np.NaN, 22, 23, 24, np.NaN, 26],
[35, 36, 37, 38, 39, 40]]))
Calculate date of birth for everyone:计算每个人的出生日期:
dob=[]
for irow, row in enumerate(df.iterrows()):
dob.append(np.asarray([int(each) for each in df.columns]) - np.asarray(df.iloc[irow,:]))
or , if you are into list comprehensions :或者,如果您喜欢列表推导:
dob = [np.asarray([int(each) for each in df.columns]) - np.asarray(df.iloc[irow,:]) for irow, row in enumerate(df.iterrows())]
Now dob
is like this:现在
dob
是这样的:
[array([ nan, nan, 1977., 1977., 1977., 1977.]),
array([1976., 1975., 1975., 1975., 1975., 1975.]),
array([ nan, nan, nan, nan, 1979., 1979.]),
array([ nan, 1970., 1970., 1970., nan, 1970.]),
array([1956., 1956., 1956., 1956., 1956., 1956.])]
Make a simpler dob list using np.unique , remove nans :使用np.unique制作一个更简单的 dob 列表,删除nans :
dob_filtered=[np.unique(each[~np.isnan(each)])[0] for each in dob]
dob_filtered
now looks like this: dob_filtered
现在看起来像这样:
[1977.0, 1975.0, 1979.0, 1970.0, 1956.0]
Attach this list to dataframe:将此列表附加到 dataframe:
df['dob']=dob_filtered
Fill in the NaN
s of the df
using the dob
column:使用
dob
列填写df
的NaN
:
for irow, row in enumerate(df.index):
for icol, col in enumerate(df.columns[:-2]):
df.loc[row,col] = col - df['dob'][row]
Delete the dob
column (just to obtain the original columns only, otherwise not important): 删除
dob
列(只是为了获取原始列而已,否则不重要):
df.drop(['dob'],axis=1)
Obtaining:获得:
Year 1992 1992 1993 1994 1995 1996
ID
29925 15.0 15.0 16.0 17.0 18.0 19.0
223725 17.0 17.0 18.0 19.0 20.0 21.0
280165 13.0 13.0 14.0 15.0 16.0 17.0
813285 22.0 22.0 23.0 24.0 25.0 26.0
956765 36.0 36.0 37.0 38.0 39.0 40.0
ie IE
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.