简体   繁体   中英

NaN replace on pandas DataFrame raises TypeError: No matching signature found

Purpose

I have a large DataFrame with varied dtypes where I have to perform a global .replace to turn both NaN, NaT and empty strings into None . The DataFrame looks like

import pandas as pd
from datetime import datetime

df = pd.DataFrame({
    'a': [n*10.0 for n in range(5)],
    'b': [datetime.now() if n%3 else None for n in range(5)],
    'c': pd.Series([f'D{n}' if n%2 else '' for n in range(5)], dtype='category'),
    'd': ['Long text chunk...' if n%3 else None for n in range(5)]
})

Which prints

      a                          b   c                   d
0   0.0                        NaT                    None
1  10.0 2020-08-13 23:35:55.533189  D1  Long text chunk...
2  20.0 2020-08-13 23:35:55.533189      Long text chunk...
3  30.0                        NaT  D3                None
4  40.0 2020-08-13 23:35:55.533189      Long text chunk...

My purpose is to bulk upload the rows into ElasticSearch, which won't accept NaN - neither NaT nor empty strings for date fields - without some setting changes I'm trying to avoid. I figured this way would be faster than individually checking every row when making the dicts.

Approach

Converting all columns to object before replacing wasn't even runnable due to the DataFrame size - I'd prefer not to convert any column at all. An approach that once worked was

df.fillna('').replace('', None)

But now, adding some category dtypes in, it raises TypeError: No matching signature found .

Question

Searching this, nothing I found was related to pandas at all. It's clearly linked to the category dtype¹, but what I don't know:

  • What's the most pythonic way of doing this while keeping integrity for all columns, especially the categorical ones ?

  • What happens behind the curtains for pandas to raise this apparently generic error in a .replace ?


¹ Edit:

I later found that the pandas implementation replace in this case reaches up to a Cython-compiled method - pandas._libs.algos.pad_inplace - which expects to fill any Series dtype except category . That's why my error mentions a signature mismatch. I still wonder if this is intended behavior, as I'd expect an ffill to work especially well in categorical columns.


Since my numeric columns were filled already, I changed column a here to reflect that. So my hassle is solely the category dtype.

The How

For one-off replace operations, it's good to avoid global conversions to object because that's costly processing-wise and memory-wise. But, as @hpaul mentioned in a comment, None is an object and not a primitive value, thus a Series must be of object type to contain it. eg a datetime Series will always turn None into NaT , because that's the primitive representation of the absence of a primitive date value . As is NaN for numeric dtypes and category .

Given that, I found this method to be best:

df.replace((np.nan, ''), (None, None))

As a result, we get:

      a                           b     c                   d
0   0.0                        None  None                None
1  10.0  2020-08-14 01:09:41.936421    D1  Long text chunk...
2  20.0  2020-08-14 01:09:41.936421  None  Long text chunk...
3  30.0                        None    D3                None
4  40.0  2020-08-14 01:09:41.936421  None  Long text chunk...

Due to also not relying on .astype or .fillna beforehand, this is both safer (better conversions¹) and more performant than other methods:

In [2]: %timeit -n 1000 df.replace((np.nan, ''), (None, None))
1.32 ms ± 47.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [3]: %timeit -n 1000 df.replace({np.nan: None, '': None})
                        # ^ pandas translates this into the first call,
                        # taking a few more milliseconds
1.36 ms ± 38.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [4]: %timeit -n 1000 df.astype(object).where(df.notnull(), None).where(df != '', None)
2.83 ms ± 78.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

¹ pandas converts the dtypes it needs to (anything other than numerics and object itself) into object , but this method is faster because conversion is lazily done, and has the advantage of being implicitly handled by pandas. A demonstration:

In [5]: df.dtypes
a           float64
b    datetime64[ns]
c          category
d            object
dtype: object

Meanwhile, after the replace

In [6]: df.replace((np.nan, ''), (None, None)).dtypes
a    float64
b     object
c     object
d     object
dtype: object

The float64 column didn't have any empty values to replace, so it didn't change at all.

Do note this is not the same as .replace(np.nan, None).replace('', None) , which would result in the same TypeError , because...

The Why

The reason this TypeError happens goes way back into pandas' Cython implementation of the default replace method, which is called pad or forward fill. But it also has to do with API choices:

  • Cython issue: the method called in this scenario ( pandas._libs.algos.pad_inplace ) expects to fill any Series dtype except category , that's why the error mentions a signature mismatch.
  • API uncertainty: Passing None as a positional argument can be misleading - pandas treats this as if "you're not passing anything as the replace value" instead of "you're passing nothing as the replace value".

Notice what happens when converting the DataFrame to object and then using the same method that once worked:

In [7]: df.astype(object).fillna('').replace('', None)
      a                           b   c                   d
0
1  10.0  2020-08-13 21:18:42.520455  D1  Long text chunk...
2  20.0  2020-08-13 21:18:42.520455  D1  Long text chunk...
3  30.0  2020-08-13 21:18:42.520455  D3  Long text chunk...
4  40.0  2020-08-13 21:18:42.520455  D3  Long text chunk...

Values have been forward filled , as can be seen more easily in column c . This is because, in practice, .replace('', None) is the same as .replace('') , and pandas' API has taken the route of assuming the above is the kind of behavior sought by this operation - a plain forward fill. Except, as explained, that wouldn't work for category dtypes.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM