简体   繁体   English

如何在pandas DataFrame中替换多个分类中的值

[英]How to replace values in multiple categoricals in a pandas DataFrame

I want to replace certain values in a dataframe containing multiple categoricals. 我想替换包含多个分类的数据框中的某些值。

df = pd.DataFrame({'s1': ['a', 'b', 'c'], 's2': ['a', 'c', 'd']}, dtype='category')

If I apply .replace on a single column, the result is as expected: 如果我在单个列上应用.replace ,结果如预期:

>>> df.s1.replace('a', 1)
0    1
1    b
2    c
Name: s1, dtype: object

If I apply the same operation to the whole dataframe, an error is shown (short version): 如果我对整个数据帧应用相同的操作,则会显示错误(简短版本):

>>> df.replace('a', 1)
ValueError: Cannot setitem on a Categorical with a new category, set the categories first

During handling of the above exception, another exception occurred:
ValueError: Wrong number of dimensions

If the dataframe contains integers as categories, the following happens: 如果数据框包含整数作为类别,则会发生以下情况:

df = pd.DataFrame({'s1': [1, 2, 3], 's2': [1, 3, 4]}, dtype='category')

>>> df.replace(1, 3)
    s1  s2
0   3   3
1   2   3
2   3   4

But, 但,

>>> df.replace(1, 2)
ValueError: Wrong number of dimensions

What am I missing? 我错过了什么?

Without digging, that seems to be buggy to me. 没有挖掘,这似乎对我来说是错误的。

My Work Around 我的工作
pd.DataFrame.apply with pd.Series.replace pd.DataFrame.applypd.Series.replace
This has the advantage that you don't need to mess with changing any types. 这样做的好处是您不需要改变任何类型。

df = pd.DataFrame({'s1': [1, 2, 3], 's2': [1, 3, 4]}, dtype='category')
df.apply(pd.Series.replace, to_replace=1, value=2)

  s1  s2
0  2   2
1  2   3
2  3   4

Or 要么

df = pd.DataFrame({'s1': ['a', 'b', 'c'], 's2': ['a', 'c', 'd']}, dtype='category')
df.apply(pd.Series.replace, to_replace='a', value=1)

  s1 s2
0  1  1
1  b  c
2  c  d

@cᴏʟᴅsᴘᴇᴇᴅ's Work Around @cᴏʟᴅsᴘᴇᴇᴅ的工作

df = pd.DataFrame({'s1': ['a', 'b', 'c'], 's2': ['a', 'c', 'd']}, dtype='category')
df.applymap(str).replace('a', 1)

  s1 s2
0  1  1
1  b  c
2  c  d

The reason for such behavior is different set of categorical values for each column: 这种行为的原因是每列的不同分类值集:

In [224]: df.s1.cat.categories
Out[224]: Index(['a', 'b', 'c'], dtype='object')

In [225]: df.s2.cat.categories
Out[225]: Index(['a', 'c', 'd'], dtype='object')

so if you will replace to a value that is in both categories it'll work: 因此,如果您将替换为两个类别中的值,它将起作用:

In [226]: df.replace('d','a')
Out[226]:
  s1 s2
0  a  a
1  b  c
2  c  a

As a solution you might want to make your columns categorical manually, using: 作为解决方案,您可能希望手动对列进行分类,使用:

pd.Categorical(..., categories=[...])

where categories would have all possible values for all columns... 其中category将包含所有列的所有可能值...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM