[英]how to fill missing values based on column in pandas?
i have this dataframe in pandas: 我在熊猫中有这个数据帧:
df = pandas.DataFrame({
"n": ["a", "b", "c", "a", "b", "x"],
"t": [0, 0, 0, 1, 1, 1],
"v": [10,20,30,40,50,60]
})
how can it be filled with missing values such that every value of column t
has the same entries in column n
? 如何填充缺失值,使列t
每个值在列n
具有相同的条目? that is every t
value should have entries for a, b, c, x
, recorded as NaN
if they are missing: 这是每个t
值应该有a, b, c, x
条目,如果它们丢失则记录为NaN
:
n t v
a 0 10
b 0 20
c 0 30
x NaN NaN
a 1 40
b 1 50
c NaN NaN
x 1 60
plan 计划
'n'
. 获得列'n'
唯一值。 we'll use this to reindex
by 我们将使用它来reindex
f
to our groups within each group of column 't'
reindexing by idx
will ensure we get all elements of idx
represented for each group of unique 't'
我们将f
't'
应用于每个列't'
组,通过idx
重新索引将确保我们获得为每组唯一't'
表示的所有idx
元素 reindex
in a bit 我们设置索引,以便我们可以reindex
一点 idx = df.n.unique()
f = lambda x: x.reindex(idx)
df.set_index('n').groupby('t', group_keys=False).apply(f).reset_index()
n t v
0 a 0.0 10.0
1 b 0.0 20.0
2 c 0.0 30.0
3 x NaN NaN
4 a 1.0 40.0
5 b 1.0 50.0
6 c NaN NaN
7 x 1.0 60.0
You can use, if in df
are no NaN
before - create MultiIndex
and then reindex
, NaN
in t
are set by column v
: 您可以使用,如果在df
之前没有NaN
- 创建MultiIndex
然后reindex
,则t
中的NaN
由列v
设置:
cols = ["n", "t"]
df1 = df.set_index(cols)
mux = pd.MultiIndex.from_product(df1.index.levels, names=cols)
df1 = df1.reindex(mux).sort_index(level=[1,0]).reset_index()
df1['t'] = df1['t'].mask(df1['v'].isnull())
print (df1)
n t v
0 a 0.0 10.0
1 b 0.0 20.0
2 c 0.0 30.0
3 x NaN NaN
4 a 1.0 40.0
5 b 1.0 50.0
6 c NaN NaN
7 x 1.0 60.0
Another solution for adding NaN is unstack
, stack
method: 另一种添加NaN的解决方案是unstack
, stack
方法:
cols = ["n", "t"]
df1 = df.set_index(cols)['v'].unstack().stack(dropna=False)
df1 = df1.sort_index(level=[1,0]).reset_index(name='v')
df1['t'] = df1['t'].mask(df1['v'].isnull())
print (df1)
n t v
0 a 0.0 10.0
1 b 0.0 20.0
2 c 0.0 30.0
3 x NaN NaN
4 a 1.0 40.0
5 b 1.0 50.0
6 c NaN NaN
7 x 1.0 60.0
But if some NaN
values need groupby
with loc
by unique
values of n
column: 但是,如果某些NaN
值需要groupby
和loc
的unique
值n
列:
df = pd.DataFrame({"n": ["a", "b", "c", "a", "b", "x"],
"t": [0, 0, 0, 1, 1, 1],
"v": [10,20,30,40,50,np.nan]})
print (df)
n t v
0 a 0 10.0
1 b 0 20.0
2 c 0 30.0
3 a 1 40.0
4 b 1 50.0
5 x 1 NaN
df1 = df.set_index('n')
.groupby('t', group_keys=False)
.apply(lambda x: x.loc[df.n.unique()])
.reset_index()
print (df1)
n t v
0 a 0.0 10.0
1 b 0.0 20.0
2 c 0.0 30.0
3 x NaN NaN
4 a 1.0 40.0
5 b 1.0 50.0
6 c NaN NaN
7 x 1.0 NaN
df1 = df.groupby('t', group_keys=False)
.apply(lambda x: x.set_index('n').loc[df.n.unique()])
.reset_index()
print (df1)
n t v
0 a 0.0 10.0
1 b 0.0 20.0
2 c 0.0 30.0
3 x NaN NaN
4 a 1.0 40.0
5 b 1.0 50.0
6 c NaN NaN
7 x 1.0 NaN
From what I understand, you want every value in "n"
to be equally distributed among sub-groups grouped by "t"
. 根据我的理解,您希望"n"
每个值均匀分布在按"t"
分组的子组中。 I'm also hoping that those "n"
cannot be duplicated in these sub-groups. 我也希望这些"n"
不能在这些子组中重复。
Considering these assumptions to be true, pd.pivot_table
seems to be a good option for this use case. 考虑到这些假设是正确的, pd.pivot_table
似乎是这个用例的一个很好的选择。 Here, the values under "n"
would constitute the column names, "t"
would be the grouped index, and the contents of the DF
get filled by the values under "v"
. 这里, "n"
下的值构成列名, "t"
是分组索引, DF
的内容由"v"
下的值填充。 Later stack the DF
while preserving NaN
entries and fill it's corresponding cells in "t"
with .loc
accessor. 稍后堆叠DF
同时保留NaN
条目并使用.loc
访问器填充"t"
的相应单元格。
df1 = pd.pivot_table(df, "v", "t", "n", "first").stack(dropna=False).reset_index(name="v")
df1.loc[df1['v'].isnull(), "t"] = np.nan
Seems like you're building it wrong. 好像你错了。 Normally NaN are read in automatically or you specify them. 通常情况下会自动读入NaN或指定它们。 You can manually put in NaN's by np.nan
if you have import numpy as np
at the top. 如果你在顶部有import numpy as np
,你可以通过np.nan
手动输入NaN。 Alternatively pandas stores numpy internally and you can get a Nan by pandas.np.nan
或者,熊猫在内部存储numpy,你可以通过pandas.np.nan
获得一个Nan
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.