简体   繁体   中英

How to replace values in multiple categoricals in a pandas DataFrame

I want to replace certain values in a dataframe containing multiple categoricals.

df = pd.DataFrame({'s1': ['a', 'b', 'c'], 's2': ['a', 'c', 'd']}, dtype='category')

If I apply .replace on a single column, the result is as expected:

>>> df.s1.replace('a', 1)
0    1
1    b
2    c
Name: s1, dtype: object

If I apply the same operation to the whole dataframe, an error is shown (short version):

>>> df.replace('a', 1)
ValueError: Cannot setitem on a Categorical with a new category, set the categories first

During handling of the above exception, another exception occurred:
ValueError: Wrong number of dimensions

If the dataframe contains integers as categories, the following happens:

df = pd.DataFrame({'s1': [1, 2, 3], 's2': [1, 3, 4]}, dtype='category')

>>> df.replace(1, 3)
    s1  s2
0   3   3
1   2   3
2   3   4

But,

>>> df.replace(1, 2)
ValueError: Wrong number of dimensions

What am I missing?

Without digging, that seems to be buggy to me.

My Work Around
pd.DataFrame.apply with pd.Series.replace
This has the advantage that you don't need to mess with changing any types.

df = pd.DataFrame({'s1': [1, 2, 3], 's2': [1, 3, 4]}, dtype='category')
df.apply(pd.Series.replace, to_replace=1, value=2)

  s1  s2
0  2   2
1  2   3
2  3   4

Or

df = pd.DataFrame({'s1': ['a', 'b', 'c'], 's2': ['a', 'c', 'd']}, dtype='category')
df.apply(pd.Series.replace, to_replace='a', value=1)

  s1 s2
0  1  1
1  b  c
2  c  d

@cᴏʟᴅsᴘᴇᴇᴅ's Work Around

df = pd.DataFrame({'s1': ['a', 'b', 'c'], 's2': ['a', 'c', 'd']}, dtype='category')
df.applymap(str).replace('a', 1)

  s1 s2
0  1  1
1  b  c
2  c  d

The reason for such behavior is different set of categorical values for each column:

In [224]: df.s1.cat.categories
Out[224]: Index(['a', 'b', 'c'], dtype='object')

In [225]: df.s2.cat.categories
Out[225]: Index(['a', 'c', 'd'], dtype='object')

so if you will replace to a value that is in both categories it'll work:

In [226]: df.replace('d','a')
Out[226]:
  s1 s2
0  a  a
1  b  c
2  c  a

As a solution you might want to make your columns categorical manually, using:

pd.Categorical(..., categories=[...])

where categories would have all possible values for all columns...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM