熊猫填充模式

Question

I have a data set in which there is a column known as Native Country which contain around 30000 records.我有一个数据集，其中有一个名为 Native Country 的列，其中包含大约30000条记录。 Some are missing represented by NaN so I thought to fill it with mode() value.有些缺失由NaN表示，所以我想用mode()值填充它。 I wrote something like this:我写了这样的东西：

data['Native Country'].fillna(data['Native Country'].mode(), inplace=True)

However when I do a count of missing values:但是，当我计算缺失值时：

for col_name in data.columns: 
    print ("column:",col_name,".Missing:",sum(data[col_name].isnull()))

It is still coming up with the same number of NaN values for the column Native Country.它仍然为 Native Country 列提供相同数量的NaN值。

Answer 1

Just call first element of series:只需调用系列的第一个元素：

data['Native Country'].fillna(data['Native Country'].mode()[0], inplace=True)

or you can do the same with assisgnment:或者你也可以用赋值来做同样的事情：

data['Native Country'] = data['Native Country'].fillna(data['Native Country'].mode()[0])

Answer 2

请注意，NaN 可能是您的数据帧的模式：在这种情况下，您将用另一个 NaN 替换 NaN。

Answer 3

If we fill in the missing values with fillna(df['colX'].mode()) , since the result of mode() is a Series, it will only fill in the first couple of rows for the matching indices.如果我们用fillna(df['colX'].mode())填充缺失值，因为mode()的结果是一个系列，它只会填充匹配索引的前几行。 At least if done as below:至少如果按照以下方式完成：

fill_mode = lambda col: col.fillna(col.mode())
df.apply(fill_mode, axis=0)

However, by simply taking the first value of the Series fillna(df['colX'].mode()[0]) , I think we risk introducing unintended bias in the data.但是，通过简单地采用系列fillna(df['colX'].mode()[0])的第一个值，我认为我们可能会在数据中引入意外偏差。 If the sample is multimodal, taking just the first mode value makes the already biased imputation method worse.如果样本是多峰的，只取第一个众数会使已经有偏差的插补方法变得更糟。 For example, taking only 0 if we have [0, 21, 99] as the equally most frequent values.例如，如果我们有[0, 21, 99]作为同样最频繁的值，则只取0 。 Or filling missing values with False when True and False values are equally frequent in a given column.或者，当True和False值在给定列中的频率相同时，用False填充缺失值。

I don't have a clear cut solution here.我在这里没有明确的解决方案。 Assigning a random value from all the local maxima could be one approach if using the mode is a necessity.如果必须使用该模式，则从所有局部最大值中分配一个随机值可能是一种方法。

Answer 4

import numpy as np

import pandas as pd

print(pd.__version__)

1.2.0 1.2.0

df = pd.DataFrame({'Country': [np.nan, 'France', np.nan, 'Spain', 'France'], 'Purchased': [np.nan,'Yes', 'Yes', 'No', np.nan]})

	Country国家	Purchased已购买
0 0	NaN NaN	NaN NaN
1 1	France法国	Yes是的
2 2	NaN NaN	Yes是的
3 3	Spain西班牙	No不
4 4	France法国	NaN NaN

 df.fillna(df.mode())  ## only applied on first row because df.mode() returns a dataframe with one row

	Country国家	Purchased已购买
0 0	France法国	Yes是的
1 1	France法国	Yes是的
2 2	NaN NaN	Yes是的
3 3	Spain西班牙	No不
4 4	France法国	NaN NaN

df = pd.DataFrame({'Country': [np.nan, 'France', np.nan, 'Spain', 'France'], 'Purchased': [np.nan,'Yes', 'Yes', 'No', np.nan]})

df.fillna(df.mode().iloc[0]) ## convert df to a series

	Country国家	Purchased已购买
0 0	France法国	Yes是的
1 1	France法国	Yes是的
2 2	France法国	Yes是的
3 3	Spain西班牙	No不
4 4	France法国	Yes是的

Answer 5

尝试类似： fill_mode = lambda col: col.fillna(col.mode())和函数： new_df = df.apply(fill_mode, axis=0)

Answer 6

You can get the number 'mode' or any another strategy您可以获得数字“模式”或任何其他策略

num = data['Native Country'].mode()
data['Native Country'].fillna(num, inplace=True)

or in one line like this或者像这样在一行中

data['Native Country'].fillna(data['Native Country'].mode()[0], inplace=True)

Answer 7

For those who came here (as I did) to fill NAs in multiple columns, grouped by multiple columns and have problem that mode returns nothing, where there are only NA values in the group:对于那些来到这里（就像我一样）在多列中填充 NAs 的人，按多列分组并且遇到模式不返回任何内容的问题，其中组中只有 NA 值：

df[['col_to_fill_NA_1','col_to_fill_NA_2']] = df.groupby(['col_to_group_by_1', 'col_to_group_by_2'], dropna=False)[['col_to_fill_NA_1','col_to_fill_NA_2']].transform(lambda x: x.fillna(x.mode()[0]) if len(x.mode()) == 1 else x)

you can fill any number of "col_to_fill_NA" and make group by any number of "col_to_group_by".您可以填充任意数量的“col_to_fill_NA”并按任意数量的“col_to_group_by”进行分组。 The if statement returns mode if mode exists and returns NAs for the groups, where there are only NAs. if 语句在 mode 存在时返回 mode 并返回组的 NA，其中只有 NA。

熊猫填充模式

问题描述

7 个解决方案

解决方案1
51 已采纳 2017-03-14 15:16:19

解决方案2
6 2018-06-06 10:04:27

解决方案3
2 2020-02-04 20:24:33

解决方案4
1 2021-01-19 16:18:20

解决方案5
0 2020-09-28 23:24:52

解决方案6
0 2021-01-09 14:12:22

解决方案7
0 2021-03-08 12:44:40

熊猫填充模式

问题描述

7 个解决方案

解决方案1 51 已采纳 2017-03-14 15:16:19

解决方案2 6 2018-06-06 10:04:27

解决方案3 2 2020-02-04 20:24:33

解决方案4 1 2021-01-19 16:18:20

解决方案5 0 2020-09-28 23:24:52

解决方案6 0 2021-01-09 14:12:22

解决方案7 0 2021-03-08 12:44:40

解决方案1
51 已采纳 2017-03-14 15:16:19

解决方案2
6 2018-06-06 10:04:27

解决方案3
2 2020-02-04 20:24:33

解决方案4
1 2021-01-19 16:18:20

解决方案5
0 2020-09-28 23:24:52

解决方案6
0 2021-01-09 14:12:22

解决方案7
0 2021-03-08 12:44:40