简体   繁体   English

NaN 替换 pandas DataFrame 引发 TypeError:找不到匹配的签名

[英]NaN replace on pandas DataFrame raises TypeError: No matching signature found

Purpose目的

I have a large DataFrame with varied dtypes where I have to perform a global .replace to turn both NaN, NaT and empty strings into None .我有一个大型 DataFrame 具有不同的 dtypes,我必须执行全局.replace以将NaN、NaT 和空字符串都转换None The DataFrame looks like DataFrame 看起来像

import pandas as pd
from datetime import datetime

df = pd.DataFrame({
    'a': [n*10.0 for n in range(5)],
    'b': [datetime.now() if n%3 else None for n in range(5)],
    'c': pd.Series([f'D{n}' if n%2 else '' for n in range(5)], dtype='category'),
    'd': ['Long text chunk...' if n%3 else None for n in range(5)]
})

Which prints哪个打印

      a                          b   c                   d
0   0.0                        NaT                    None
1  10.0 2020-08-13 23:35:55.533189  D1  Long text chunk...
2  20.0 2020-08-13 23:35:55.533189      Long text chunk...
3  30.0                        NaT  D3                None
4  40.0 2020-08-13 23:35:55.533189      Long text chunk...

My purpose is to bulk upload the rows into ElasticSearch, which won't accept NaN - neither NaT nor empty strings for date fields - without some setting changes I'm trying to avoid.我的目的是将行批量上传到 ElasticSearch 中,它不会接受 NaN - 既不 NaT 也不接受日期字段的空字符串 - 没有我试图避免的一些设置更改。 I figured this way would be faster than individually checking every row when making the dicts.我认为这种方式比在制作字典时单独检查每一行要快。

Approach方法

Converting all columns to object before replacing wasn't even runnable due to the DataFrame size - I'd prefer not to convert any column at all.由于 DataFrame 大小,在替换之前将所有列转换为object甚至无法运行 - 我宁愿根本不转换任何列。 An approach that once worked was曾经奏效的一种方法是

df.fillna('').replace('', None)

But now, adding some category dtypes in, it raises TypeError: No matching signature found .但是现在,添加一些类别 dtypes,它会引发TypeError: No matching signature found

Question问题

Searching this, nothing I found was related to pandas at all.搜索这个,我发现没有任何东西与pandas有关。 It's clearly linked to the category dtype¹, but what I don't know:它显然与类别 dtype¹相关,但我不知道:

  • What's the most pythonic way of doing this while keeping integrity for all columns, especially the categorical ones ?在保持所有列(尤其是分类列)的完整性的同时,最pythonic的方式是什么?

  • What happens behind the curtains for pandas to raise this apparently generic error in a .replace ? pandas 在 .replace中引发这个明显的通用错误的幕后会发生什么?


¹ Edit: ¹ 编辑:

I later found that the pandas implementation replace in this case reaches up to a Cython-compiled method - pandas._libs.algos.pad_inplace - which expects to fill any Series dtype except category .我后来发现 pandas 实现替换在这种情况下达到了 Cython 编译的方法 - pandas._libs.algos.pad_inplace - 它期望填充除category之外的任何 Series dtype。 That's why my error mentions a signature mismatch.这就是为什么我的错误提到签名不匹配的原因。 I still wonder if this is intended behavior, as I'd expect an ffill to work especially well in categorical columns.我仍然想知道这是否是预期的行为,因为我希望 ffill 在分类列中特别有效。


Since my numeric columns were filled already, I changed column a here to reflect that.由于我的数字列已经填满,我在这里更改a列以反映这一点。 So my hassle is solely the category dtype.所以我的麻烦只是category dtype。

The How如何

For one-off replace operations, it's good to avoid global conversions to object because that's costly processing-wise and memory-wise.对于一次性替换操作,最好避免将全局转换为object ,因为这在处理和内存方面的成本很高。 But, as @hpaul mentioned in a comment, None is an object and not a primitive value, thus a Series must be of object type to contain it.但是,正如@hpaul 在评论中提到的, None是 object 而不是原始值,因此 Series必须是 object 类型才能包含它。 eg a datetime Series will always turn None into NaT , because that's the primitive representation of the absence of a primitive date value .例如, datetime时间系列将始终将None转换为NaT ,因为这是缺少原始日期值的原始表示。 As is NaN for numeric dtypes and category .与数字 dtypes 和categoryNaN一样。

Given that, I found this method to be best:鉴于此,我发现这种方法是最好的:

df.replace((np.nan, ''), (None, None))

As a result, we get:结果,我们得到:

      a                           b     c                   d
0   0.0                        None  None                None
1  10.0  2020-08-14 01:09:41.936421    D1  Long text chunk...
2  20.0  2020-08-14 01:09:41.936421  None  Long text chunk...
3  30.0                        None    D3                None
4  40.0  2020-08-14 01:09:41.936421  None  Long text chunk...

Due to also not relying on .astype or .fillna beforehand, this is both safer (better conversions¹) and more performant than other methods:由于事先也不依赖.astype.fillna ,这比其他方法更安全(更好的转换¹)和更高的性能:

In [2]: %timeit -n 1000 df.replace((np.nan, ''), (None, None))
1.32 ms ± 47.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [3]: %timeit -n 1000 df.replace({np.nan: None, '': None})
                        # ^ pandas translates this into the first call,
                        # taking a few more milliseconds
1.36 ms ± 38.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [4]: %timeit -n 1000 df.astype(object).where(df.notnull(), None).where(df != '', None)
2.83 ms ± 78.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

¹ pandas converts the dtypes it needs to (anything other than numerics and object itself) into object , but this method is faster because conversion is lazily done, and has the advantage of being implicitly handled by pandas. ¹ pandas converts the dtypes it needs to (anything other than numerics and object itself) into object , but this method is faster because conversion is lazily done, and has the advantage of being implicitly handled by pandas. A demonstration:一个示范:

In [5]: df.dtypes
a           float64
b    datetime64[ns]
c          category
d            object
dtype: object

Meanwhile, after the replace同时,更换后

In [6]: df.replace((np.nan, ''), (None, None)).dtypes
a    float64
b     object
c     object
d     object
dtype: object

The float64 column didn't have any empty values to replace, so it didn't change at all. float64列没有任何要替换的空值,所以它根本没有改变。

Do note this is not the same as .replace(np.nan, None).replace('', None) , which would result in the same TypeError , because...请注意,.replace(np.nan, None).replace('', None)不同,这将导致相同的TypeError ,因为...

The Why为什么

The reason this TypeError happens goes way back into pandas' Cython implementation of the default replace method, which is called pad or forward fill.发生这种TypeError的原因可以追溯到 pandas 的默认替换方法的 Cython 实现,该方法称为填充或前向填充。 But it also has to do with API choices:但这也与 API 选择有关:

  • Cython issue: the method called in this scenario ( pandas._libs.algos.pad_inplace ) expects to fill any Series dtype except category , that's why the error mentions a signature mismatch. Cython 问题:在这种情况下调用的方法( pandas._libs.algos.pad_inplace )期望填充除category之外的任何 Series dtype,这就是错误提到签名不匹配的原因。
  • API uncertainty: Passing None as a positional argument can be misleading - pandas treats this as if "you're not passing anything as the replace value" instead of "you're passing nothing as the replace value". API 不确定性:将None作为位置参数传递可能会产生误导 - pandas 将此视为“您没有任何内容作为替换值传递”而不是“您没有传递任何内容作为替换值”。

Notice what happens when converting the DataFrame to object and then using the same method that once worked:注意将 DataFrame 转换为object然后使用曾经有效的相同方法时会发生什么:

In [7]: df.astype(object).fillna('').replace('', None)
      a                           b   c                   d
0
1  10.0  2020-08-13 21:18:42.520455  D1  Long text chunk...
2  20.0  2020-08-13 21:18:42.520455  D1  Long text chunk...
3  30.0  2020-08-13 21:18:42.520455  D3  Long text chunk...
4  40.0  2020-08-13 21:18:42.520455  D3  Long text chunk...

Values have been forward filled , as can be seen more easily in column c .值已被前向填充,在c列中可以更容易地看到。 This is because, in practice, .replace('', None) is the same as .replace('') , and pandas' API has taken the route of assuming the above is the kind of behavior sought by this operation - a plain forward fill.这是因为,在实践中, .replace('', None).replace('')相同,并且 pandas 的 API 采取了假设上述是此操作所寻求的那种行为的路线 - 一个普通的向前填充。 Except, as explained, that wouldn't work for category dtypes.除非,如上所述,这不适用于category dtype。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM