简体   繁体   English

熊猫填充模式

[英]Pandas Fillna Mode

I have a data set in which there is a column known as Native Country which contain around 30000 records.我有一个数据集,其中有一个名为 Native Country 的列,其中包含大约30000条记录。 Some are missing represented by NaN so I thought to fill it with mode() value.有些缺失由NaN表示,所以我想用mode()值填充它。 I wrote something like this:我写了这样的东西:

data['Native Country'].fillna(data['Native Country'].mode(), inplace=True)

However when I do a count of missing values:但是,当我计算缺失值时:

for col_name in data.columns: 
    print ("column:",col_name,".Missing:",sum(data[col_name].isnull()))

It is still coming up with the same number of NaN values for the column Native Country.它仍然为 Native Country 列提供相同数量的NaN值。

Just call first element of series:只需调用系列的第一个元素:

data['Native Country'].fillna(data['Native Country'].mode()[0], inplace=True)

or you can do the same with assisgnment:或者你也可以用赋值来做同样的事情:

data['Native Country'] = data['Native Country'].fillna(data['Native Country'].mode()[0])

请注意,NaN 可能是您的数据帧的模式:在这种情况下,您将用另一个 NaN 替换 NaN。

If we fill in the missing values with fillna(df['colX'].mode()) , since the result of mode() is a Series, it will only fill in the first couple of rows for the matching indices.如果我们用fillna(df['colX'].mode())填充缺失值,因为mode()的结果是一个系列,它只会填充匹配索引的前几行。 At least if done as below:至少如果按照以下方式完成:

fill_mode = lambda col: col.fillna(col.mode())
df.apply(fill_mode, axis=0)

However, by simply taking the first value of the Series fillna(df['colX'].mode()[0]) , I think we risk introducing unintended bias in the data.但是,通过简单地采用系列fillna(df['colX'].mode()[0])的第一个值,我认为我们可能会在数据中引入意外偏差。 If the sample is multimodal, taking just the first mode value makes the already biased imputation method worse.如果样本是多峰的,只取第一个众数会使已经有偏差的插补方法变得更糟。 For example, taking only 0 if we have [0, 21, 99] as the equally most frequent values.例如,如果我们有[0, 21, 99]作为同样最频繁的值,则只取0 Or filling missing values with False when True and False values are equally frequent in a given column.或者,当TrueFalse值在给定列中的频率相同时,用False填充缺失值。

I don't have a clear cut solution here.我在这里没有明确的解决方案。 Assigning a random value from all the local maxima could be one approach if using the mode is a necessity.如果必须使用该模式,则从所有局部最大值中分配一个随机值可能是一种方法。

import numpy as np

import pandas as pd

print(pd.__version__)

1.2.0 1.2.0

df = pd.DataFrame({'Country': [np.nan, 'France', np.nan, 'Spain', 'France'], 'Purchased': [np.nan,'Yes', 'Yes', 'No', np.nan]})
Country国家 Purchased已购买
0 0 NaN NaN NaN NaN
1 1 France法国 Yes是的
2 2 NaN NaN Yes是的
3 3 Spain西班牙 No
4 4 France法国 NaN NaN
 df.fillna(df.mode())  ## only applied on first row because df.mode() returns a dataframe with one row
Country国家 Purchased已购买
0 0 France法国 Yes是的
1 1 France法国 Yes是的
2 2 NaN NaN Yes是的
3 3 Spain西班牙 No
4 4 France法国 NaN NaN
df = pd.DataFrame({'Country': [np.nan, 'France', np.nan, 'Spain', 'France'], 'Purchased': [np.nan,'Yes', 'Yes', 'No', np.nan]})

df.fillna(df.mode().iloc[0]) ## convert df to a series
Country国家 Purchased已购买
0 0 France法国 Yes是的
1 1 France法国 Yes是的
2 2 France法国 Yes是的
3 3 Spain西班牙 No
4 4 France法国 Yes是的

尝试类似: fill_mode = lambda col: col.fillna(col.mode())和函数: new_df = df.apply(fill_mode, axis=0)

You can get the number 'mode' or any another strategy您可以获得数字“模式”或任何其他策略

num = data['Native Country'].mode()
data['Native Country'].fillna(num, inplace=True)

or in one line like this或者像这样在一行中

data['Native Country'].fillna(data['Native Country'].mode()[0], inplace=True)

For those who came here (as I did) to fill NAs in multiple columns, grouped by multiple columns and have problem that mode returns nothing, where there are only NA values in the group:对于那些来到这里(就像我一样)在多列中填充 NAs 的人,按多列分组并且遇到模式不返回任何内容的问题,其中组中只有 NA 值:

df[['col_to_fill_NA_1','col_to_fill_NA_2']] = df.groupby(['col_to_group_by_1', 'col_to_group_by_2'], dropna=False)[['col_to_fill_NA_1','col_to_fill_NA_2']].transform(lambda x: x.fillna(x.mode()[0]) if len(x.mode()) == 1 else x)

you can fill any number of "col_to_fill_NA" and make group by any number of "col_to_group_by".您可以填充任意数量的“col_to_fill_NA”并按任意数量的“col_to_group_by”进行分组。 The if statement returns mode if mode exists and returns NAs for the groups, where there are only NAs. if 语句在 mode 存在时返回 mode 并返回组的 NA,其中只有 NA。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM