I have a large DataFrame with varied dtypes where I have to perform a global .replace
to turn both NaN, NaT and empty strings into None
. The DataFrame looks like
import pandas as pd
from datetime import datetime
df = pd.DataFrame({
'a': [n*10.0 for n in range(5)],
'b': [datetime.now() if n%3 else None for n in range(5)],
'c': pd.Series([f'D{n}' if n%2 else '' for n in range(5)], dtype='category'),
'd': ['Long text chunk...' if n%3 else None for n in range(5)]
})
Which prints
a b c d
0 0.0 NaT None
1 10.0 2020-08-13 23:35:55.533189 D1 Long text chunk...
2 20.0 2020-08-13 23:35:55.533189 Long text chunk...
3 30.0 NaT D3 None
4 40.0 2020-08-13 23:35:55.533189 Long text chunk...
My purpose is to bulk upload the rows into ElasticSearch, which won't accept NaN - neither NaT nor empty strings for date fields - without some setting changes I'm trying to avoid. I figured this way would be faster than individually checking every row when making the dicts.
Converting all columns to object
before replacing wasn't even runnable due to the DataFrame size - I'd prefer not to convert any column at all. An approach that once worked was
df.fillna('').replace('', None)
But now, adding some category dtypes in, it raises TypeError: No matching signature found
.
Searching this, nothing I found was related to pandas
at all. It's clearly linked to the category dtype¹, but what I don't know:
What's the most pythonic way of doing this while keeping integrity for all columns, especially the categorical ones ?
What happens behind the curtains for pandas to raise this apparently generic error in a
.replace
?
¹ Edit:
I later found that the pandas implementation replace in this case reaches up to a Cython-compiled method - pandas._libs.algos.pad_inplace
- which expects to fill any Series dtype except category
. That's why my error mentions a signature mismatch. I still wonder if this is intended behavior, as I'd expect an ffill to work especially well in categorical columns.
Since my numeric columns were filled already, I changed column a
here to reflect that. So my hassle is solely the category
dtype.
For one-off replace operations, it's good to avoid global conversions to object
because that's costly processing-wise and memory-wise. But, as @hpaul mentioned in a comment, None
is an object and not a primitive value, thus a Series must be of object type to contain it. eg a datetime
Series will always turn None
into NaT
, because that's the primitive representation of the absence of a primitive date value . As is NaN
for numeric dtypes and category
.
Given that, I found this method to be best:
df.replace((np.nan, ''), (None, None))
As a result, we get:
a b c d
0 0.0 None None None
1 10.0 2020-08-14 01:09:41.936421 D1 Long text chunk...
2 20.0 2020-08-14 01:09:41.936421 None Long text chunk...
3 30.0 None D3 None
4 40.0 2020-08-14 01:09:41.936421 None Long text chunk...
Due to also not relying on .astype
or .fillna
beforehand, this is both safer (better conversions¹) and more performant than other methods:
In [2]: %timeit -n 1000 df.replace((np.nan, ''), (None, None))
1.32 ms ± 47.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [3]: %timeit -n 1000 df.replace({np.nan: None, '': None})
# ^ pandas translates this into the first call,
# taking a few more milliseconds
1.36 ms ± 38.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [4]: %timeit -n 1000 df.astype(object).where(df.notnull(), None).where(df != '', None)
2.83 ms ± 78.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
¹ pandas converts the dtypes it needs to (anything other than numerics and object
itself) into object
, but this method is faster because conversion is lazily done, and has the advantage of being implicitly handled by pandas. A demonstration:
In [5]: df.dtypes
a float64
b datetime64[ns]
c category
d object
dtype: object
Meanwhile, after the replace
In [6]: df.replace((np.nan, ''), (None, None)).dtypes
a float64
b object
c object
d object
dtype: object
The float64
column didn't have any empty values to replace, so it didn't change at all.
Do note this is not the same as .replace(np.nan, None).replace('', None)
, which would result in the same TypeError
, because...
The reason this TypeError
happens goes way back into pandas' Cython implementation of the default replace method, which is called pad or forward fill. But it also has to do with API choices:
pandas._libs.algos.pad_inplace
) expects to fill any Series dtype except category
, that's why the error mentions a signature mismatch. None
as a positional argument can be misleading - pandas treats this as if "you're not passing anything as the replace value" instead of "you're passing nothing as the replace value". Notice what happens when converting the DataFrame to object
and then using the same method that once worked:
In [7]: df.astype(object).fillna('').replace('', None)
a b c d
0
1 10.0 2020-08-13 21:18:42.520455 D1 Long text chunk...
2 20.0 2020-08-13 21:18:42.520455 D1 Long text chunk...
3 30.0 2020-08-13 21:18:42.520455 D3 Long text chunk...
4 40.0 2020-08-13 21:18:42.520455 D3 Long text chunk...
Values have been forward filled , as can be seen more easily in column c
. This is because, in practice, .replace('', None)
is the same as .replace('')
, and pandas' API has taken the route of assuming the above is the kind of behavior sought by this operation - a plain forward fill. Except, as explained, that wouldn't work for category
dtypes.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.