数据类型“对象”数组上的 np.isnan

Question

我正在处理不同数据类型的 numpy 数组。 我想知道，在任何特定数组中，哪些元素是 NaN。 通常，这就是np.isnan的用途。

但是， np.isnan对数据类型object （或任何字符串数据类型）的数组并不友好：

>>> str_arr = np.array(["A", "B", "C"])
>>> np.isnan(str_arr)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: Not implemented for this type

>>> obj_arr = np.array([1, 2, "A"], dtype=object)
>>> np.isnan(obj_arr)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

我想从这两个调用中得到的只是np.array([False, False, False]) 。 我不能只是在我对np.isnan调用中加上try和except TypeError并假设任何生成TypeError数组不包含 NaN：毕竟，我想要np.isnan(np.array([1, np.NaN, "A"]))返回np.array([False, True, False]) 。

我目前的解决方案是创建一个类型为np.float64的新数组，循环遍历原始数组的元素， try将该元素放入新数组中（如果失败，则将其保留为零），然后调用np.isnan在新阵列上。 然而，这当然是相当缓慢的。 （至少，对于大型对象数组。）

def isnan(arr):
    if isinstance(arr, np.ndarray) and (arr.dtype == object):
        # Create a new array of dtype float64, fill it with the same values as the input array (where possible), and
        # then call np.isnan on the new array. This way, np.isnan is only called once. (Much faster than calling it on
        # every element in the input array.)
        new_arr = np.zeros((len(arr),), dtype=np.float64)
        for idx in xrange(len(arr)):
            try:
                new_arr[idx] = arr[idx]
            except Exception:
                pass
        return np.isnan(new_arr)
    else:
        try:
            return np.isnan(arr)
        except TypeError:
            return False

这个特定的实现也只适用于一维数组，我想不出一个体面的方法来让for循环在任意数量的维度上运行。

有没有更有效的方法来确定object类型数组中的哪些元素是 NaN？

编辑：我正在运行 Python 2.7.10。

请注意， [x is np.nan for x in np.array([np.nan])]返回False ： np.nan在内存中并不总是与不同的np.nan相同的对象。

我不希望字符串"nan"被视为等同于np.nan ：我希望isnan(np.array(["nan"], dtype=object))返回np.array([False]) 。

多维不是一个大问题。 （没有什么是一点点ravel reshape无法解决的。：p）

任何依赖is运算符来测试两个 NaN 的等价性的函数并不总是有效。 （如果你认为他们应该这样做，问问自己is运算符实际上是做什么的！）

Answer 1

如果您愿意使用Pandas库，一个涵盖这种情况的方便函数是pd.isnull ：

pandas.isnull(obj)

检测缺失值（数值数组中的 NaN，对象数组中的 None/NaN）

下面是一个例子：

$ python
>>> import numpy   
>>> import pandas
>>> array = numpy.asarray(['a', float('nan')], dtype=object)
>>> pandas.isnull(array)
array([False,  True])

Answer 2

您可以使用 list comp 来获取在这种情况下可能更快的任何 nan 的索引：

obj_arr = np.array([1, 2, np.nan, "A"], dtype=object)

inds = [i for i,n in enumerate(obj_arr) if str(n) == "nan"]

或者如果你想要一个布尔掩码：

mask = [True if str(n) == "nan" else False for n in obj_arr]

使用is np.nan似乎也可以工作而无需is np.nan转换为 str：

In [29]: obj_arr = np.array([1, 2, np.nan, "A"], dtype=object)

In [30]: [x is np.nan for x in obj_arr]
Out[30]: [False, False, True, False]

对于平面和多维数组，您可以检查形状：

def masks(a):
    if len(a.shape) > 1:
        return [[x is np.nan for x in sub] for sub in a]
    return [x is np.nan for x in a]

如果 np.nan 可能失败，请检查类型，然后我们 np.isnan

def masks(a):
    if len(a.shape) > 1:
        return [[isinstance(x, float) and np.isnan(x) for x in sub] for sub in arr]
    return [isinstance(x, float) and np.isnan(x)  for x in arr]

有趣的x is np.nan ，当数据类型为object时， x is np.nan似乎工作正常：

In [76]: arr = np.array([np.nan,np.nan,"3"],dtype=object)

In [77]: [x is np.nan  for x in arr]
Out[77]: [True, True, False]

In [78]: arr = np.array([np.nan,np.nan,"3"])

In [79]: [x is np.nan  for x in arr]
Out[79]: [False, False, False]

根据 dtype 不同的事情发生：

In [90]: arr = np.array([np.nan,np.nan,"3"])

In [91]: arr.dtype
Out[91]: dtype('S32')

In [92]: arr
Out[92]: 
array(['nan', 'nan', '3'], 
      dtype='|S32')

In [93]: [x == "nan"  for x in arr]
Out[93]: [True, True, False]

In [94]: arr = np.array([np.nan,np.nan,"3"],dtype=object)

In [95]: arr.dtype
Out[95]: dtype('O')

In [96]: arr
Out[96]: array([nan, nan, '3'], dtype=object)

In [97]: [x == "nan"  for x in arr]
Out[97]: [False, False, False]

显然，当数组中有字符串时，nan 会被强制为numpy.string_'s ，因此x == "nan"在这种情况下有效，当您传递 object 时，类型为 float，因此如果您始终使用 object dtype，则行为应该是一致的.

Answer 3

定义几个测试数组，大小不一

In [21]: x=np.array([1,23.3, np.nan, 'str'],dtype=object)
In [22]: xb=np.tile(x,300)

你的功能：

In [23]: isnan(x)
Out[23]: array([False, False,  True, False], dtype=bool)

直接的列表理解，返回一个数组

In [24]: np.array([i is np.nan for i in x])
Out[24]: array([False, False,  True, False], dtype=bool)

np.frompyfunc也有类似的矢量化力量np.vectorize ，但由于某种原因被下使用（在我的经验更快）

In [25]: def myisnan(x):
        return x is np.nan
In [26]: visnan=np.frompyfunc(myisnan,1,1)

In [27]: visnan(x)
Out[27]: array([False, False, True, False], dtype=object)

由于它返回 dtype 对象，我们可能想要转换它的值：

In [28]: visnan(x).astype(bool)
Out[28]: array([False, False,  True, False], dtype=bool)

它可以很好地处理多维数组：

In [29]: visnan(x.reshape(2,2)).astype(bool)
Out[29]: 
array([[False, False],
       [ True, False]], dtype=bool)

现在一些时间：

In [30]: timeit isnan(xb)
1000 loops, best of 3: 1.03 ms per loop

In [31]: timeit np.array([i is np.nan for i in xb])
1000 loops, best of 3: 393 us per loop

In [32]: timeit visnan(xb).astype(bool)
1000 loops, best of 3: 382 us per loop

i is np.nan一个重点i is np.nan测试 - 它仅适用于标量。 如果数组是 dtype 对象，则迭代产生标量。 但是对于numpy.float64 float 数组，我们得到numpy.float64值。 对于那些np.isnan(i)是正确的测试。

In [61]: [(i is np.nan) for i in np.array([np.nan,np.nan,1.3])]
Out[61]: [False, False, False]

In [62]: [np.isnan(i) for i in np.array([np.nan,np.nan,1.3])]
Out[62]: [True, True, False]

In [63]: [(i is np.nan) for i in np.array([np.nan,np.nan,1.3], dtype=object)]
Out[63]: [True, True, False]

In [64]: [np.isnan(i) for i in np.array([np.nan,np.nan,1.3],  dtype=object)]
...
TypeError: Not implemented for this type

Answer 4

我会使用np.vectorize和一个测试 nan 元素的自定义函数。 所以，

def _isnan(x):
    if isinstance(x, type(np.nan)):
        return np.isnan(x)
    else:
        return False

my_isnan = np.vectorize(_isnan)

然后

X = np.array([[1, 2, np.nan, "A"], [np.nan, True, [], ""]], dtype=object)
my_isnan(X)

返回

 array([[False, False,  True, False],
        [ True, False, False, False]], dtype=bool)

Answer 5

在不转换为字符串或离开 Numpy 环境（也是非常重要的 IMO）的情况下执行此操作的一种方法是使用 np.nan 的相等定义，其中

In[1]: x=np.nan
In[2]: x==x
Out[2]: False

这仅在 x==np.nan 时成立。 因此，对于 Numpy 数组，逐元素检查

x!=x

对于x==np.nan每个元素返回True

Answer 6

这是我最终为自己构建的：

FLOAT_TYPES = (float, np.float64, np.float32, np.complex, np.complex64, np.complex128)

def isnan(arr):
    """Equivalent of np.isnan, except made to also be friendly towards arrays of object/string dtype."""
    
    if isinstance(arr, np.ndarray):
        if arr.dtype == object:
            # An element can only be NaN if it's a float, and is not equal to itself. (NaN != NaN, by definition.)
            # NaN is the only float that doesn't equal itself, so "(x != x) and isinstance(x, float)" tests for NaN-ity.
            # Numpy's == checks identity for object arrays, so "x != x" will always return False, so can't vectorize.
            is_nan = np.array([((x != x) and isinstance(x, FLOAT_TYPES)) for x in arr.ravel()], dtype=bool)
            return is_nan.reshape(arr.shape)
        if arr.dtype.kind in "fc":  # Only [f]loats and [c]omplex numbers can be NaN
            return np.isnan(arr)
        return np.zeros(arr.shape, dtype=bool)
    
    if isinstance(arr, FLOAT_TYPES):
        return np.isnan(arr)
   
    return False

数据类型“对象”数组上的 np.isnan

问题描述

6 个解决方案

解决方案1
12 2018-04-01 22:43:17

解决方案2
4 2016-03-24 10:57:22

解决方案3
1 2016-03-24 16:40:08

解决方案4
0 2016-03-24 11:18:49

解决方案5
0 2017-03-14 15:31:08

解决方案6
0 已采纳 2020-07-27 08:09:28

数据类型“对象”数组上的 np.isnan

问题描述

6 个解决方案

解决方案1 12 2018-04-01 22:43:17

解决方案2 4 2016-03-24 10:57:22

解决方案3 1 2016-03-24 16:40:08

解决方案4 0 2016-03-24 11:18:49

解决方案5 0 2017-03-14 15:31:08

解决方案6 0 已采纳 2020-07-27 08:09:28

解决方案1
12 2018-04-01 22:43:17

解决方案2
4 2016-03-24 10:57:22

解决方案3
1 2016-03-24 16:40:08

解决方案4
0 2016-03-24 11:18:49

解决方案5
0 2017-03-14 15:31:08

解决方案6
0 已采纳 2020-07-27 08:09:28