[英]Pandas Fillna Mode
I have a data set in which there is a column known as Native Country which contain around 30000
records.我有一个数据集,其中有一个名为 Native Country 的列,其中包含大约
30000
条记录。 Some are missing represented by NaN
so I thought to fill it with mode()
value.有些缺失由
NaN
表示,所以我想用mode()
值填充它。 I wrote something like this:我写了这样的东西:
data['Native Country'].fillna(data['Native Country'].mode(), inplace=True)
However when I do a count of missing values:但是,当我计算缺失值时:
for col_name in data.columns:
print ("column:",col_name,".Missing:",sum(data[col_name].isnull()))
It is still coming up with the same number of NaN
values for the column Native Country.它仍然为 Native Country 列提供相同数量的
NaN
值。
Just call first element of series:只需调用系列的第一个元素:
data['Native Country'].fillna(data['Native Country'].mode()[0], inplace=True)
or you can do the same with assisgnment:或者你也可以用赋值来做同样的事情:
data['Native Country'] = data['Native Country'].fillna(data['Native Country'].mode()[0])
请注意,NaN 可能是您的数据帧的模式:在这种情况下,您将用另一个 NaN 替换 NaN。
If we fill in the missing values with fillna(df['colX'].mode())
, since the result of mode()
is a Series, it will only fill in the first couple of rows for the matching indices.如果我们用
fillna(df['colX'].mode())
填充缺失值,因为mode()
的结果是一个系列,它只会填充匹配索引的前几行。 At least if done as below:至少如果按照以下方式完成:
fill_mode = lambda col: col.fillna(col.mode())
df.apply(fill_mode, axis=0)
However, by simply taking the first value of the Series fillna(df['colX'].mode()[0])
, I think we risk introducing unintended bias in the data.但是,通过简单地采用系列
fillna(df['colX'].mode()[0])
的第一个值,我认为我们可能会在数据中引入意外偏差。 If the sample is multimodal, taking just the first mode value makes the already biased imputation method worse.如果样本是多峰的,只取第一个众数会使已经有偏差的插补方法变得更糟。 For example, taking only
0
if we have [0, 21, 99]
as the equally most frequent values.例如,如果我们有
[0, 21, 99]
作为同样最频繁的值,则只取0
。 Or filling missing values with False
when True
and False
values are equally frequent in a given column.或者,当
True
和False
值在给定列中的频率相同时,用False
填充缺失值。
I don't have a clear cut solution here.我在这里没有明确的解决方案。 Assigning a random value from all the local maxima could be one approach if using the mode is a necessity.
如果必须使用该模式,则从所有局部最大值中分配一个随机值可能是一种方法。
import numpy as np
import pandas as pd
print(pd.__version__)
1.2.0
1.2.0
df = pd.DataFrame({'Country': [np.nan, 'France', np.nan, 'Spain', 'France'], 'Purchased': [np.nan,'Yes', 'Yes', 'No', np.nan]})
Country![]() |
Purchased![]() |
|
---|---|---|
0 ![]() |
NaN ![]() |
NaN ![]() |
1 ![]() |
France![]() |
Yes![]() |
2 ![]() |
NaN ![]() |
Yes![]() |
3 ![]() |
Spain![]() |
No![]() |
4 ![]() |
France![]() |
NaN ![]() |
df.fillna(df.mode()) ## only applied on first row because df.mode() returns a dataframe with one row
Country![]() |
Purchased![]() |
|
---|---|---|
0 ![]() |
France![]() |
Yes![]() |
1 ![]() |
France![]() |
Yes![]() |
2 ![]() |
NaN ![]() |
Yes![]() |
3 ![]() |
Spain![]() |
No![]() |
4 ![]() |
France![]() |
NaN ![]() |
df = pd.DataFrame({'Country': [np.nan, 'France', np.nan, 'Spain', 'France'], 'Purchased': [np.nan,'Yes', 'Yes', 'No', np.nan]})
df.fillna(df.mode().iloc[0]) ## convert df to a series
Country![]() |
Purchased![]() |
|
---|---|---|
0 ![]() |
France![]() |
Yes![]() |
1 ![]() |
France![]() |
Yes![]() |
2 ![]() |
France![]() |
Yes![]() |
3 ![]() |
Spain![]() |
No![]() |
4 ![]() |
France![]() |
Yes![]() |
尝试类似: fill_mode = lambda col: col.fillna(col.mode())
和函数: new_df = df.apply(fill_mode, axis=0)
You can get the number 'mode' or any another strategy您可以获得数字“模式”或任何其他策略
num = data['Native Country'].mode()
data['Native Country'].fillna(num, inplace=True)
or in one line like this或者像这样在一行中
data['Native Country'].fillna(data['Native Country'].mode()[0], inplace=True)
For those who came here (as I did) to fill NAs in multiple columns, grouped by multiple columns and have problem that mode returns nothing, where there are only NA values in the group:对于那些来到这里(就像我一样)在多列中填充 NAs 的人,按多列分组并且遇到模式不返回任何内容的问题,其中组中只有 NA 值:
df[['col_to_fill_NA_1','col_to_fill_NA_2']] = df.groupby(['col_to_group_by_1', 'col_to_group_by_2'], dropna=False)[['col_to_fill_NA_1','col_to_fill_NA_2']].transform(lambda x: x.fillna(x.mode()[0]) if len(x.mode()) == 1 else x)
you can fill any number of "col_to_fill_NA" and make group by any number of "col_to_group_by".您可以填充任意数量的“col_to_fill_NA”并按任意数量的“col_to_group_by”进行分组。 The if statement returns mode if mode exists and returns NAs for the groups, where there are only NAs.
if 语句在 mode 存在时返回 mode 并返回组的 NA,其中只有 NA。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.