[英]NaN replace on pandas DataFrame raises TypeError: No matching signature found
I have a large DataFrame with varied dtypes where I have to perform a global .replace
to turn both NaN, NaT and empty strings into None
.我有一个大型 DataFrame 具有不同的 dtypes,我必须执行全局
.replace
以将NaN、NaT 和空字符串都转换为None
。 The DataFrame looks like DataFrame 看起来像
import pandas as pd
from datetime import datetime
df = pd.DataFrame({
'a': [n*10.0 for n in range(5)],
'b': [datetime.now() if n%3 else None for n in range(5)],
'c': pd.Series([f'D{n}' if n%2 else '' for n in range(5)], dtype='category'),
'd': ['Long text chunk...' if n%3 else None for n in range(5)]
})
Which prints哪个打印
a b c d
0 0.0 NaT None
1 10.0 2020-08-13 23:35:55.533189 D1 Long text chunk...
2 20.0 2020-08-13 23:35:55.533189 Long text chunk...
3 30.0 NaT D3 None
4 40.0 2020-08-13 23:35:55.533189 Long text chunk...
My purpose is to bulk upload the rows into ElasticSearch, which won't accept NaN - neither NaT nor empty strings for date fields - without some setting changes I'm trying to avoid.我的目的是将行批量上传到 ElasticSearch 中,它不会接受 NaN - 既不 NaT 也不接受日期字段的空字符串 - 没有我试图避免的一些设置更改。 I figured this way would be faster than individually checking every row when making the dicts.
我认为这种方式比在制作字典时单独检查每一行要快。
Converting all columns to object
before replacing wasn't even runnable due to the DataFrame size - I'd prefer not to convert any column at all.由于 DataFrame 大小,在替换之前将所有列转换为
object
甚至无法运行 - 我宁愿根本不转换任何列。 An approach that once worked was曾经奏效的一种方法是
df.fillna('').replace('', None)
But now, adding some category dtypes in, it raises TypeError: No matching signature found
.但是现在,添加一些类别 dtypes,它会引发
TypeError: No matching signature found
。
Searching this, nothing I found was related to pandas
at all.搜索这个,我发现没有任何东西与
pandas
有关。 It's clearly linked to the category dtype¹, but what I don't know:它显然与类别 dtype¹相关,但我不知道:
What's the most pythonic way of doing this while keeping integrity for all columns, especially the categorical ones ?在保持所有列(尤其是分类列)的完整性的同时,最pythonic的方式是什么?
What happens behind the curtains for pandas to raise this apparently generic error in a
.replace
?
pandas 在
.replace
中引发这个明显的通用错误的幕后会发生什么?
¹ Edit: ¹ 编辑:
I later found that the pandas implementation replace in this case reaches up to a Cython-compiled method - pandas._libs.algos.pad_inplace
- which expects to fill any Series dtype except category
.我后来发现 pandas 实现替换在这种情况下达到了 Cython 编译的方法 -
pandas._libs.algos.pad_inplace
- 它期望填充除category
之外的任何 Series dtype。 That's why my error mentions a signature mismatch.这就是为什么我的错误提到签名不匹配的原因。 I still wonder if this is intended behavior, as I'd expect an ffill to work especially well in categorical columns.
我仍然想知道这是否是预期的行为,因为我希望 ffill 在分类列中特别有效。
Since my numeric columns were filled already, I changed column a
here to reflect that.由于我的数字列已经填满,我在这里更改
a
列以反映这一点。 So my hassle is solely the category
dtype.所以我的麻烦只是
category
dtype。
For one-off replace operations, it's good to avoid global conversions to object
because that's costly processing-wise and memory-wise.对于一次性替换操作,最好避免将全局转换为
object
,因为这在处理和内存方面的成本很高。 But, as @hpaul mentioned in a comment, None
is an object and not a primitive value, thus a Series must be of object type to contain it.但是,正如@hpaul 在评论中提到的,
None
是 object 而不是原始值,因此 Series必须是 object 类型才能包含它。 eg a datetime
Series will always turn None
into NaT
, because that's the primitive representation of the absence of a primitive date value .例如,
datetime
时间系列将始终将None
转换为NaT
,因为这是缺少原始日期值的原始表示。 As is NaN
for numeric dtypes and category
.与数字 dtypes 和
category
的NaN
一样。
Given that, I found this method to be best:鉴于此,我发现这种方法是最好的:
df.replace((np.nan, ''), (None, None))
As a result, we get:结果,我们得到:
a b c d
0 0.0 None None None
1 10.0 2020-08-14 01:09:41.936421 D1 Long text chunk...
2 20.0 2020-08-14 01:09:41.936421 None Long text chunk...
3 30.0 None D3 None
4 40.0 2020-08-14 01:09:41.936421 None Long text chunk...
Due to also not relying on .astype
or .fillna
beforehand, this is both safer (better conversions¹) and more performant than other methods:由于事先也不依赖
.astype
或.fillna
,这比其他方法更安全(更好的转换¹)和更高的性能:
In [2]: %timeit -n 1000 df.replace((np.nan, ''), (None, None))
1.32 ms ± 47.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [3]: %timeit -n 1000 df.replace({np.nan: None, '': None})
# ^ pandas translates this into the first call,
# taking a few more milliseconds
1.36 ms ± 38.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [4]: %timeit -n 1000 df.astype(object).where(df.notnull(), None).where(df != '', None)
2.83 ms ± 78.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
¹ pandas converts the dtypes it needs to (anything other than numerics and object
itself) into object
, but this method is faster because conversion is lazily done, and has the advantage of being implicitly handled by pandas. ¹ pandas converts the dtypes it needs to (anything other than numerics and
object
itself) into object
, but this method is faster because conversion is lazily done, and has the advantage of being implicitly handled by pandas. A demonstration:一个示范:
In [5]: df.dtypes
a float64
b datetime64[ns]
c category
d object
dtype: object
Meanwhile, after the replace同时,更换后
In [6]: df.replace((np.nan, ''), (None, None)).dtypes
a float64
b object
c object
d object
dtype: object
The float64
column didn't have any empty values to replace, so it didn't change at all. float64
列没有任何要替换的空值,所以它根本没有改变。
Do note this is not the same as .replace(np.nan, None).replace('', None)
, which would result in the same TypeError
, because...请注意,这与
.replace(np.nan, None).replace('', None)
不同,这将导致相同的TypeError
,因为...
The reason this TypeError
happens goes way back into pandas' Cython implementation of the default replace method, which is called pad or forward fill.发生这种
TypeError
的原因可以追溯到 pandas 的默认替换方法的 Cython 实现,该方法称为填充或前向填充。 But it also has to do with API choices:但这也与 API 选择有关:
pandas._libs.algos.pad_inplace
) expects to fill any Series dtype except category
, that's why the error mentions a signature mismatch. pandas._libs.algos.pad_inplace
)期望填充除category
之外的任何 Series dtype,这就是错误提到签名不匹配的原因。None
as a positional argument can be misleading - pandas treats this as if "you're not passing anything as the replace value" instead of "you're passing nothing as the replace value". None
作为位置参数传递可能会产生误导 - pandas 将此视为“您没有将任何内容作为替换值传递”而不是“您没有传递任何内容作为替换值”。 Notice what happens when converting the DataFrame to object
and then using the same method that once worked:注意将 DataFrame 转换为
object
然后使用曾经有效的相同方法时会发生什么:
In [7]: df.astype(object).fillna('').replace('', None)
a b c d
0
1 10.0 2020-08-13 21:18:42.520455 D1 Long text chunk...
2 20.0 2020-08-13 21:18:42.520455 D1 Long text chunk...
3 30.0 2020-08-13 21:18:42.520455 D3 Long text chunk...
4 40.0 2020-08-13 21:18:42.520455 D3 Long text chunk...
Values have been forward filled , as can be seen more easily in column c
.值已被前向填充,在
c
列中可以更容易地看到。 This is because, in practice, .replace('', None)
is the same as .replace('')
, and pandas' API has taken the route of assuming the above is the kind of behavior sought by this operation - a plain forward fill.这是因为,在实践中,
.replace('', None)
与.replace('')
相同,并且 pandas 的 API 采取了假设上述是此操作所寻求的那种行为的路线 - 一个普通的向前填充。 Except, as explained, that wouldn't work for category
dtypes.除非,如上所述,这不适用于
category
dtype。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.