简体   繁体   English

Pandas - 在分类数据中填充 NaN

[英]Pandas - filling NaNs in Categorical data

I am trying to fill missing values (NAN) using the below code我正在尝试使用以下代码填充缺失值(NAN)

NAN_SUBSTITUTION_VALUE = 1
g = g.fillna(NAN_SUBSTITUTION_VALUE)

but I am getting the following error但我收到以下错误

ValueError: fill value must be in categories.

Would anybody please throw some light on this error.有人请对此错误有所了解。

Your question is missing the important point what g is, especially that it has dtype categorical .您的问题缺少g的重要意义,尤其是它具有 dtype categorical I assume it is something like this:我认为它是这样的:

g = pd.Series(["A", "B", "C", np.nan], dtype="category")

The problem you are experiencing is that fillna requires a value that already exists as a category.您遇到的问题是fillna需要一个已经作为类别存在的值。 For instance, g.fillna("A") would work, but g.fillna("D") fails.例如, g.fillna("A")可以工作,但g.fillna("D")失败。 To fill the series with a new value you can do:要使用新值填充系列,您可以执行以下操作:

g_without_nan = g.cat.add_categories("D").fillna("D")

Add the category before you fill:在填写之前添加类别:

g = g.cat.add_categories([1])
g.fillna(1)

Once you create Categorical Data , you can insert only values in category.创建分类数据后,您只能在类别中插入值。

>>> df
    ID  value
0    0     20
1    1     43
2    2     45

>>> df["cat"] = df["value"].astype("category")
>>> df
    ID  value    cat
0    0     20     20
1    1     43     43
2    2     45     45

>>> df.loc[1, "cat"] = np.nan
>>> df
    ID  value    cat
0    0     20     20
1    1     43    NaN
2    2     45     45

>>> df.fillna(1)
ValueError: fill value must be in categories
>>> df.fillna(43)
    ID  value    cat
0    0     20     20
1    1     43     43
2    2     45     45


As many have said before, this error comes from the fact that that feature's type is 'category'.正如许多人之前所说,此错误来自该功能的类型是“类别”这一事实。
I suggest converting it to string first, use fillna and finally convert it back to category if needed.我建议先将其转换为字符串,然后使用 fillna,最后在需要时将其转换回类别。

g = g.astype('string')
g = g.fillna(NAN_SUBSTITUTION_VALUE)
g = g.astype('category')

Sometimes you may want to replace the NaN with values present in your dataset, you can use that then:有时您可能想用数据集中存在的值替换 NaN,然后​​可以使用它:

#creates a random permuation of the categorical values
permutation = np.random.permutation(df[field])

#erase the empty values
empty_is = np.where(permutation == "")
permutation = np.delete(permutation, empty_is)

#replace all empty values of the dataframe[field]
end = len(permutation)
df[field] = df[field].apply(lambda x: permutation[np.random.randint(end)] if pd.isnull(x) else x)

It works quite efficiently.它的工作效率很高。

The deep understanding is because:深刻的理解是因为:

Categoricals can only take on only a limited, and usually fixed, number of possible values (categories).分类只能采用有限且通常是固定数量的可能值(类别)。 In contrast to statistical categorical variables, a Categorical might have an order, but numerical operations (additions, divisions, …) are not possible.与统计分类变量相比,分类变量可能有顺序,但不可能进行数值运算(加法、除法……)。

All values of the Categorical are either in categories or np.nan. Categorical 的所有值都在类别或 np.nan 中。 Assigning values outside of categories will raise a ValueError.在类别之外分配值将引发 ValueError。 Order is defined by the order of the categories, not lexical order of the values.顺序是由类别的顺序定义的,而不是值的词汇顺序。

https://pandas.pydata.org/docs/reference/api/pandas.Categorical.html https://pandas.pydata.org/docs/reference/api/pandas.Categorical.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM