如何根据其他变量估算缺失值

Question

I have a dataframe like below:我有一个 dataframe 如下所示：

df = pd.DataFrame({'one' : pd.Series(['a', 'b', 'c', 'd','aa','bb',np.nan,'b','c',np.nan, np.nan] ), 
  'two' : pd.Series([10, 20, 30, 40,50,60,10,20,30,40,50])} )

In which first column is the variables, second column is the values.其中第一列是变量，第二列是值。 Variable value is constant, which will never change.变量值是恒定的，永远不会改变。

example 'a' value is 10 , whenever 'a' is presented corrsponding value will be10例如'a' 的值为 10 ，每当出现 'a' 时，对应的值为 10

Here some values missing in first column eg: NaN 10 which is a, NaN 40 which is d like wise dataframe contains 200 variables.这里第一列中缺少一些值，例如：NaN 10 是 a，NaN 40 是明智的 dataframe 包含 200 个变量。

Values are not continuous variables, those are discrete and unsortable值不是连续变量，它们是离散且不可排序的

In this case how can we impute missing values.在这种情况下，我们如何估算缺失值。 Expected output should be:预期的 output 应该是：

Please help me on this.请帮助我。

Regards, Venkat.问候，文卡特。

Answer 1

I think in general it would be better to group and fill.我认为总的来说，分组和填充会更好。 We use DataFrame.groupby :我们使用DataFrame.groupby ：

df.groupby('two').apply(lambda x: x.ffill().bfill())

It can be done without using groupby but you have to sort by both columns:它可以在不使用 groupby 的情况下完成，但您必须按两列排序：

df.sort_values(['two','one']).ffill().sort_index()

Below I show you how the method proposed in another answer may fail:下面我向您展示另一个答案中提出的方法可能会失败：

Here is an example:这是一个例子：

df=pd.DataFrame({'one':['a',np.nan,'c','d',np.nan,'c','b','b',np.nan,'a'],'two':[10,20,30,40,10,30,20,20,30,10]})
print(df)

   one  two
0    a   10
1  NaN   20
2    c   30
3    d   40
4  NaN   10
5    c   30
6    b   20
7    b   20
8  NaN   30
9    a   10

df.sort_values(['two']).fillna(method='ffill').sort_index()


  one  two
0   a   10
1   a   20
2   c   30
3   d   40
4   a   10
5   c   30
6   b   20
7   b   20
8   c   30
9   a   10

As you can see the proposed method in another of the answers fails here( see row 1 ).如您所见，另一个答案中的建议方法在此处失败（请参见第 1 行）。 This occurs because some NaN Value can be the first for a specific value of the column 'two' and is filled with the value of the upper group.发生这种情况是因为某些 NaN 值可能是列“二”的特定值的第一个值，并用上一组的值填充。

This don't happen if we group first:如果我们先分组，则不会发生这种情况：

df.groupby('two').apply(lambda x: x.ffill().bfill())

  one  two
0   a   10
1   b   20
2   c   30
3   d   40
4   a   10
5   c   30
6   b   20
7   b   20
8   c   30
9   a   10

As I said we can use DataFrame.sort_values but we need to sort for both columns.正如我所说，我们可以使用DataFrame.sort_values但我们需要对两列进行排序。 I recommend you this method .我推荐你这个方法。

df.sort_values(['two','one']).ffill().sort_index()

  one  two
0   a   10
1   b   20
2   c   30
3   d   40
4   a   10
5   c   30
6   b   20
7   b   20
8   c   30
9   a   10

Answer 2

Here it is:这里是：

df.ffill(inplace=True)

output: output：

   one  two
0    a   10
1    b   20
2    c   30
3    d   40
4   aa   50
5   bb   60
6    a   10
7    b   20
8    c   30
9    d   40
10  aa   50

Answer 3

Try this:尝试这个：

df = df.sort_values(['two']).fillna(method='ffill').sort_index()

Which will give you哪个会给你

   one  two
0    a   10
1    b   20
2    c   30
3    d   40
4   aa   50
5   bb   60
6    a   10
7    b   20
8    c   30
9    d   40
10  aa   50

如何根据其他变量估算缺失值

问题描述

3 个解决方案

解决方案1
2 2019-11-02 10:50:27

解决方案2
1 2019-11-02 09:58:06

解决方案3
1 2019-11-02 10:00:25

如何根据其他变量估算缺失值

问题描述

3 个解决方案

解决方案1 2 2019-11-02 10:50:27

解决方案2 1 2019-11-02 09:58:06

解决方案3 1 2019-11-02 10:00:25

解决方案1
2 2019-11-02 10:50:27

解决方案2
1 2019-11-02 09:58:06

解决方案3
1 2019-11-02 10:00:25