[英]How to impute missing values based on other variables
I have a dataframe like below:我有一个 dataframe 如下所示:
df = pd.DataFrame({'one' : pd.Series(['a', 'b', 'c', 'd','aa','bb',np.nan,'b','c',np.nan, np.nan] ),
'two' : pd.Series([10, 20, 30, 40,50,60,10,20,30,40,50])} )
In which first column is the variables, second column is the values.其中第一列是变量,第二列是值。 Variable value is constant, which will never change.
变量值是恒定的,永远不会改变。
example 'a' value is 10 , whenever 'a' is presented corrsponding value will be10例如'a' 的值为 10 ,每当出现 'a' 时,对应的值为 10
Here some values missing in first column eg: NaN 10 which is a, NaN 40 which is d like wise dataframe contains 200 variables.这里第一列中缺少一些值,例如:NaN 10 是 a,NaN 40 是明智的 dataframe 包含 200 个变量。
Values are not continuous variables, those are discrete and unsortable值不是连续变量,它们是离散且不可排序的
In this case how can we impute missing values.在这种情况下,我们如何估算缺失值。 Expected output should be:
预期的 output 应该是:
Please help me on this.请帮助我。
Regards, Venkat.问候,文卡特。
I think in general it would be better to group and fill.我认为总的来说,分组和填充会更好。 We use
DataFrame.groupby
:我们使用
DataFrame.groupby
:
df.groupby('two').apply(lambda x: x.ffill().bfill())
It can be done without using groupby but you have to sort by both columns:它可以在不使用 groupby 的情况下完成,但您必须按两列排序:
df.sort_values(['two','one']).ffill().sort_index()
Below I show you how the method proposed in another answer may fail:
下面我向您展示另一个答案中提出的方法可能会失败:
Here is an example:这是一个例子:
df=pd.DataFrame({'one':['a',np.nan,'c','d',np.nan,'c','b','b',np.nan,'a'],'two':[10,20,30,40,10,30,20,20,30,10]})
print(df)
one two
0 a 10
1 NaN 20
2 c 30
3 d 40
4 NaN 10
5 c 30
6 b 20
7 b 20
8 NaN 30
9 a 10
df.sort_values(['two']).fillna(method='ffill').sort_index()
one two
0 a 10
1 a 20
2 c 30
3 d 40
4 a 10
5 c 30
6 b 20
7 b 20
8 c 30
9 a 10
As you can see the proposed method in another of the answers fails here( see row 1 ).如您所见,另一个答案中的建议方法在此处失败(请参见第 1 行)。 This occurs because some NaN Value can be the first for a specific value of the column 'two' and is filled with the value of the upper group.
发生这种情况是因为某些 NaN 值可能是列“二”的特定值的第一个值,并用上一组的值填充。
This don't happen if we group first:如果我们先分组,则不会发生这种情况:
df.groupby('two').apply(lambda x: x.ffill().bfill())
one two
0 a 10
1 b 20
2 c 30
3 d 40
4 a 10
5 c 30
6 b 20
7 b 20
8 c 30
9 a 10
As I said we can use DataFrame.sort_values
but we need to sort for both columns.正如我所说,我们可以使用
DataFrame.sort_values
但我们需要对两列进行排序。 I recommend you this method .我推荐你这个方法。
df.sort_values(['two','one']).ffill().sort_index()
one two
0 a 10
1 b 20
2 c 30
3 d 40
4 a 10
5 c 30
6 b 20
7 b 20
8 c 30
9 a 10
Here it is:这里是:
df.ffill(inplace=True)
output: output:
one two
0 a 10
1 b 20
2 c 30
3 d 40
4 aa 50
5 bb 60
6 a 10
7 b 20
8 c 30
9 d 40
10 aa 50
Try this:尝试这个:
df = df.sort_values(['two']).fillna(method='ffill').sort_index()
Which will give you哪个会给你
one two
0 a 10
1 b 20
2 c 30
3 d 40
4 aa 50
5 bb 60
6 a 10
7 b 20
8 c 30
9 d 40
10 aa 50
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.