简体   繁体   English

如何根据pandas中的列填充缺失值?

[英]how to fill missing values based on column in pandas?

i have this dataframe in pandas: 我在熊猫中有这个数据帧:

df = pandas.DataFrame({
        "n": ["a", "b", "c", "a", "b", "x"],
        "t": [0, 0, 0, 1, 1, 1],
        "v": [10,20,30,40,50,60]
    })

how can it be filled with missing values such that every value of column t has the same entries in column n ? 如何填充缺失值,使列t每个值在列n具有相同的条目? that is every t value should have entries for a, b, c, x , recorded as NaN if they are missing: 这是每个t值应该有a, b, c, x条目,如果它们丢失则记录为NaN

   n  t   v
   a  0  10
   b  0  20
   c  0  30
   x  NaN NaN
   a  1  40
   b  1  50
   c  NaN NaN
   x  1  60

plan 计划

  • get unique values of column 'n' . 获得列'n'唯一值。 we'll use this to reindex by 我们将使用它来reindex
  • we'll apply f to our groups within each group of column 't' reindexing by idx will ensure we get all elements of idx represented for each group of unique 't' 我们将f 't'应用于每个列't'组,通过idx重新索引将确保我们获得为每组唯一't'表示的所有idx元素
  • we set the index so that we can reindex in a bit 我们设置索引,以便我们可以reindex一点

idx = df.n.unique()
f = lambda x: x.reindex(idx)
df.set_index('n').groupby('t', group_keys=False).apply(f).reset_index()

   n    t     v
0  a  0.0  10.0
1  b  0.0  20.0
2  c  0.0  30.0
3  x  NaN   NaN
4  a  1.0  40.0
5  b  1.0  50.0
6  c  NaN   NaN
7  x  1.0  60.0

You can use, if in df are no NaN before - create MultiIndex and then reindex , NaN in t are set by column v : 您可以使用,如果在df之前没有NaN - 创建MultiIndex然后reindex ,则t中的NaN由列v设置:

cols = ["n", "t"]
df1 = df.set_index(cols)
mux = pd.MultiIndex.from_product(df1.index.levels, names=cols)
df1 = df1.reindex(mux).sort_index(level=[1,0]).reset_index()
df1['t'] = df1['t'].mask(df1['v'].isnull())
print (df1)
   n    t     v
0  a  0.0  10.0
1  b  0.0  20.0
2  c  0.0  30.0
3  x  NaN   NaN
4  a  1.0  40.0
5  b  1.0  50.0
6  c  NaN   NaN
7  x  1.0  60.0

Another solution for adding NaN is unstack , stack method: 另一种添加NaN的解决方案是unstackstack方法:

cols = ["n", "t"]
df1 = df.set_index(cols)['v'].unstack().stack(dropna=False)
df1 = df1.sort_index(level=[1,0]).reset_index(name='v')
df1['t'] = df1['t'].mask(df1['v'].isnull())
print (df1)
    n    t     v
0  a  0.0  10.0
1  b  0.0  20.0
2  c  0.0  30.0
3  x  NaN   NaN
4  a  1.0  40.0
5  b  1.0  50.0
6  c  NaN   NaN
7  x  1.0  60.0

But if some NaN values need groupby with loc by unique values of n column: 但是,如果某些NaN值需要groupbylocuniquen列:

df = pd.DataFrame({"n": ["a", "b", "c", "a", "b", "x"], 
                       "t": [0, 0, 0, 1, 1, 1], 
                       "v": [10,20,30,40,50,np.nan]})
print (df)
   n  t     v
0  a  0  10.0
1  b  0  20.0
2  c  0  30.0
3  a  1  40.0
4  b  1  50.0
5  x  1   NaN

df1 = df.set_index('n')
        .groupby('t', group_keys=False)
        .apply(lambda x: x.loc[df.n.unique()])
        .reset_index()

print (df1)
   n    t     v
0  a  0.0  10.0
1  b  0.0  20.0
2  c  0.0  30.0
3  x  NaN   NaN
4  a  1.0  40.0
5  b  1.0  50.0
6  c  NaN   NaN
7  x  1.0   NaN   

df1 = df.groupby('t', group_keys=False)
        .apply(lambda x: x.set_index('n').loc[df.n.unique()])
        .reset_index()
print (df1)
   n    t     v
0  a  0.0  10.0
1  b  0.0  20.0
2  c  0.0  30.0
3  x  NaN   NaN
4  a  1.0  40.0
5  b  1.0  50.0
6  c  NaN   NaN
7  x  1.0   NaN

From what I understand, you want every value in "n" to be equally distributed among sub-groups grouped by "t" . 根据我的理解,您希望"n"每个值均匀分布在按"t"分组的子组中。 I'm also hoping that those "n" cannot be duplicated in these sub-groups. 我也希望这些"n"不能在这些子组中重复。

Considering these assumptions to be true, pd.pivot_table seems to be a good option for this use case. 考虑到这些假设是正确的, pd.pivot_table似乎是这个用例的一个很好的选择。 Here, the values under "n" would constitute the column names, "t" would be the grouped index, and the contents of the DF get filled by the values under "v" . 这里, "n"下的值构成列名, "t"是分组索引, DF的内容由"v"下的值填充。 Later stack the DF while preserving NaN entries and fill it's corresponding cells in "t" with .loc accessor. 稍后堆叠DF同时保留NaN条目并使用.loc访问器填充"t"的相应单元格。

df1 = pd.pivot_table(df, "v", "t", "n", "first").stack(dropna=False).reset_index(name="v")
df1.loc[df1['v'].isnull(), "t"] = np.nan

在此输入图像描述

Seems like you're building it wrong. 好像你错了。 Normally NaN are read in automatically or you specify them. 通常情况下会自动读入NaN或指定它们。 You can manually put in NaN's by np.nan if you have import numpy as np at the top. 如果你在顶部有import numpy as np ,你可以通过np.nan手动输入NaN。 Alternatively pandas stores numpy internally and you can get a Nan by pandas.np.nan 或者,熊猫在内部存储numpy,你可以通过pandas.np.nan获得一个Nan

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM