简体   繁体   English

为什么'groupby(x,np.isnan)'与'groupby(x)如果key是nan'的行为不同?

[英]Why does 'groupby(x, np.isnan)' behave differently to 'groupby(x) if key is nan'?

Since we're on the topic of peculiarities surrounding numpy's nan , I've discovered something that I don't understand either. 由于我们是围绕numpy的nan进行特殊处理的 ,所以我发现了一些我也不理解的东西。 I'm posting this question mainly as an extension of MSeifert's since it seems there might be a common reason for both of our observations. 我发布这个问题主要是作为MSeifert的扩展,因为看来我们两个观察都可能有一个共同的原因。

Earlier on, I posted a solution that involves using itertools.groupby on a sequence containing nan values: 早些时候, 我发布了一个解决方案 ,其中涉及在包含nan值的序列上使用itertools.groupby

return max((sum(1 for _ in group) for key, group in groupby(sequence) if key is nan), default=0)

However, I saw this answer on MSeifert's question linked above which shows an alternative way I might have formulated this algorithm: 但是,我在上面链接的MSeifert问题上看到了这个答案 ,它显示了我可能制定此算法的另一种方法:

return max((sum(1 for _ in group) for key, group in groupby(sequence, np.isnan)), default=0)

Experiment 实验

I've tested both of these variations with both lists and numpy arrays. 我已经使用列表和numpy数组测试了这两种变体。 The code and results are included below: 代码和结果如下:

from itertools import groupby

from numpy import nan
import numpy as np


def longest_nan_run(sequence):
    return max((sum(1 for _ in group) for key, group in groupby(sequence) if key is nan), default=0)


def longest_nan_run_2(sequence):
    return max((sum(1 for _ in group) for key, group in groupby(sequence, np.isnan)), default=0)


if __name__ == '__main__':
    nan_list = [nan, nan, nan, 0.16, 1, 0.16, 0.9999, 0.0001, 0.16, 0.101, nan, 0.16]
    nan_array = np.array(nan_list)

    print(longest_nan_run(nan_list))  # 3 - correct
    print(longest_nan_run_2(nan_list))  # 7 - incorrect
    print(longest_nan_run(nan_array))  # 0 - incorrect
    print(longest_nan_run_2(nan_array))  # 7 - incorrect

Analysis 分析

  • Of all four combinations, only checks against lists using the original function works as expected. 在所有四个组合中,只有使用原始功能的列表检查才能按预期进行。
  • The modified function (using np.isnan ) seems to work the same way for both lists and arrays. 修改后的函数(使用np.isnan )对于列表和数组似乎都以相同的方式工作。
  • The original function does not appear to find any nan values when checking arrays . 检查数组时, 原始函数似乎找不到任何nan值。

Can anyone explain these results? 谁能解释这些结果? Again, as this question is related to MSeifert's, it's possible that an explanation of his results would explain mine too (or vice versa). 同样,由于这个问题与MSeifert有关,因此对他的结果的解释也可能解释我的观点(反之亦然)。


Further Investigation 进一步的调查

To get a better picture of what's happening, I tried printing out the groups generated by groupby : 为了更好地了解正在发生的事情,我尝试打印出groupby生成的组:

def longest_nan_run(sequence):
    print(list(list(group) for key, group in groupby(sequence) if key is nan))
    return max((sum(1 for _ in group) for key, group in groupby(sequence) if key is nan), default=0)


def longest_nan_run_2(sequence):
    print(list(list(group) for _, group in groupby(sequence, np.isnan)))
    return max((sum(1 for _ in group) for key, group in groupby(sequence, np.isnan)), default=0)

One fundamental difference (which in retrospect makes sense) is that the original function (using if key is nan ) will filter out everything except nan values, so all generated groups will consist only of nan values, like this: 一个根本的区别(回想起来很有意义)是原始函数( if key is nan )将过滤 nan以外的所有内容,因此所有生成的组将仅由nan值组成,如下所示:

[[nan, nan, nan], [nan]]

On the other hand, the modified function will group all non- nan values into their own groups, like this: 另一方面, 修改后的函数会将所有非nan值分组为自己的组,如下所示:

[[nan, nan, nan], [0.16, 1.0, 0.16, 0.99990000000000001, 0.0001, 0.16, 0.10100000000000001], [nan], [0.16]]

This explains why the modified function returns 7 in both cases - it's considering values as either " nan " or "not nan " and returning the longest contiguous series of either. 这就解释了为什么修改后的函数在两种情况下都返回7它考虑将值视为“ nan ”或“ not nan ”,并返回最长的连续序列。

This also means that I was wrong about my assumptions of how groupby(sequence, keyfunc) works, and that the modified function is not a viable alternative to the original. 这也意味着我对groupby(sequence, keyfunc)工作原理的假设是错误的,并且修改后的函数不是原始函数的可行替代方案。

I'm still not sure about the difference in results when running the original function on lists and arrays, though. 但是,我仍然不确定在列表和数组上运行原始函数时结果的差异。

Item access in numpy arrays behaves different than in lists: numpy数组中的项目访问行为与列表不同:

nan_list[0] == nan_list[1]
# False
nan_list[0] is nan_list[1]
# True

nan_array[0] == nan_array[1]
# False
nan_array[0] is nan_array[1]
# False

x = np.array([1])
x[0] == x[0]
# True
x[0] is x[0]
# False

While the list contains references to the same object, numpy arrays 'contain' only a region of memory and create new Python objects on the fly each time an element is accessed. 虽然列表包含对同一对象的引用,但numpy数组仅“包含”内存区域,并且每次访问元素时都会动态创建新的Python对象。 (Thank you user2357112, for pointing out the inaccuracy in phrasing.) (感谢user2357112,指出了措词上的不准确性。)

Makes sense, right? 有道理吧? Same object returned by the list, different objects returned by the array - obviously groupby internally uses is for comparison... But wait, it's not that easy! 列表返回的是同一对象,数组返回的是不同的对象-显然groupby内部使用is为了进行比较……但是,这并不是那么容易! Why does groupby(np.array([1, 1, 1, 2, 3])) work correctly? 为什么groupby(np.array([1, 1, 1, 2, 3]))正常工作?

The answer is buried in the the itertools C source , line 90 shows that the function PyObject_RichCompareBool is used for comparing two keys. 答案隐藏在itertools C源代码中 ,第90行显示函数PyObject_RichCompareBool用于比较两个键。

rcmp = PyObject_RichCompareBool(gbo->tgtkey, gbo->currkey, Py_EQ);

Although this is basically equivalent to using == in Python, the docs note one speciality: 尽管这基本上等效于在Python中使用== ,但是文档注意到一种特殊性:

Note If o1 and o2 are the same object, PyObject_RichCompareBool() will always return 1 for Py_EQ and 0 for Py_NE . 注意:如果O1和O2是同一个对象, PyObject_RichCompareBool()将始终返回1Py_EQ0Py_NE

This means that actually this comparison is performed (equivalent code): 这意味着实际上执行了此比较(等效代码):

if o1 is o2:
    return True
else:
    return o1 == o2

So for lists, we have the same nan objects, which is identified as equal. 因此,对于列表,我们具有相同的nan对象,它们被标识为相等。 In contrast, arrays give us different objects with value nan , which are compared with == - but nan == nan always evaluates as False . 相反,数组为我们提供了值nan不同对象,它们与== -进行比较,但nan == nan总是求值为False

Alright, I think I've painted a clear enough picture for myself of what's going on. 好吧,我想我已经为自己画了足够清楚的图画。

There two factors at play here: 这里有两个因素在起作用:

  • My own misunderstanding of what the keyfunc argument did for groupby . 我自己对keyfunc参数对groupby所做的误解。
  • The (much more interesting) story of how Python represents nan values within arrays and lists, which is best explained in this answer . 关于Python如何表示数组和列表中的nan值的(非常有趣的)故事,在此答案中得到了最好的解释。

Explaining the keyfunc factor 解释keyfunc

From the documentation on groupby : groupby文档中

It generates a break or new group every time the value of the key function changes 每当关键功能的值改变时,它都会产生一个中断或新的组

From the documentation on np.isnan : np.isnan文档中

For scalar input, the result is a new boolean with value True if the input is NaN; 对于标量输入,如果输入为NaN,则结果为值为True的新布尔值;否则为false。 otherwise the value is False. 否则,值为False。

Based on these two things, we deduce that when we set keyfunc as np.isnan , each element in the sequence passed to groupyby will be mapped to either True or False , depending on whether it is a nan or not. 基于这两件事,我们推断出,当将keyfunc设置为np.isnan ,传递给groupyby的序列中的每个元素groupyby将被映射为TrueFalse ,这取决于它是否为nan This means that the key function will only change at the boundary between nan elements and non- nan elements, and therefore that groupby will only split the sequence into contiguous blocks of nan and non- nan elements. 这意味着键函数将仅在nan元素和非nan元素之间的边界处发生变化,因此groupby仅会将序列分为nan和non- nan元素的连续块。

In contrast, the original function (which used groupby(sequence) if key is nan ) will use the identity function for keyfunc (its default value). 相反, 原始函数( groupby(sequence) if key is nan ,则使用groupby(sequence) if key is nan )将对keyfunc (其默认值)使用identity函数 This naturally leads into the nuances of nan identity which is explained below (and in the linked answer above), but the important point here is that the if key is nan will filter out all groups keyed on non- nan elements. 这自然会导致nan身份的细微差别,这将在下面进行解释(以及在上面的链接的答案中),但是这里的if key is nan会过滤掉所有以non- nan元素为键的组。

Explaining nuances in nan identity 解释nan身份的细微差别

As better explained in the answer I linked above, all instances of nan that occur within Python's built-in lists seem to be one and the same instance . 正如我在上面链接的答案中更好地解释的那样,出现在Python内置列表中的所有nan实例似乎都是一个实例 In other words, all occurrences of nan in lists point to the same place in memory. 换句话说,列表中所有出现的nan都指向内存中的同一位置。 In contrast to this, nan elements are generated on the fly when using numpy arrays and so are all separate objects. 与此相反,使用numpy数组时,会即时生成nan元素,所有单独的对象也是如此。

This is demonstrated using the code below: 使用以下代码对此进行了演示:

def longest_nan_run(sequence):
    print(id(nan))
    print([id(x) for x in sequence])
    return max((sum(1 for _ in group) for key, group in groupby(sequence) if key is nan), default=0)

When I run this using the list defined in the original question, I obtain this output (identical elements are highlighted): 使用原始问题定义的列表运行此命令时,将获得以下输出(突出显示相同的元素):

4436731128
[4436731128, 44436731128, 44436731128, 4436730432, 4435753536, 4436730432, 4436730192, 4436730048, 4436730432, 4436730552, 44436731128, 4436730432]

On the other hand, array elements seem to be handled in memory very differently: 另一方面,数组元素在内存中的处理方式似乎非常不同:

4343850232
[4357386696, 4357386720, 4357386696, 4357386720, 4357386696, 4357386720, 4357386696, 4357386720, 4357386696,, 4357386720, 4357386696, 4357386720]

The function seems to alternate between two separate locations in memory for storing these values. 该功能似乎在内存中两个单独的位置之间交替以存储这些值。 Notice that none of the elements are identical to the nan used in the filter condition. 请注意,没有任何元素与过滤条件中使用的nan相同。

Case Studies 实例探究

We can now apply all this information we've gathered to the four separate cases used in the experiment to explain our observations. 现在,我们可以将收集到的所有信息应用于实验中用来解释我们的观察结果的四个单独案例。

Original function with lists 原始功能清单

During this case, we use the default identity function as keyfunc , and we've seen that each occurrence of nan in lists are in fact all the same instance. 在这种情况下,我们将默认的identity函数用作keyfunc ,并且我们已经看到列表中每次出现的nan实际上都是相同的实例。 The nan used in the filter conditional if key is nan is also identical to the nan elements in the list, causing groupby to break the list at appropriate places and only retain the groups containing nan . nan在过滤器条件中使用if key is nan 等同于nan列表中的元素,从而导致groupby打破列表在适当的地方,仅保留含有基团nan This is why this variant works and we obtain the correct result of 3 . 这就是为什么此变体有效并且我们获得3的正确结果的原因。

Original function with arrays 数组的原始功能

Again, we use the default identity function as keyfunc , but this time all nan occurrences - including the one in the filter conditional - point to different objects. 同样,我们将默认的identity函数用作keyfunc ,但是这次所有的nan事件(包括过滤条件中的一次)都指向不同的对象。 This means that the conditional filter if key is nan will fail for all groups. 这意味着if key is nan的条件过滤器将对所有组均失败。 Since we can't find the maximum of an empty collection, we fall back on the default value of 0 . 由于我们找不到空集合的最大值,因此我们使用默认值0

Modified function with lists and arrays 具有列表和数组的修改功能

In both of these cases, we use np.isnan as keyfunc . 在这两种情况下,我们都使用np.isnan作为keyfunc This will cause groupby to split the sequence into contiguous sequences of nan and non- nan elements. 这将导致groupby将序列分为nan和non- nan元素的连续序列。

For the list/array we used for our experiment, the longest sequence of nan elements is [nan, nan, nan] , which has three elements, and the longest sequence of non- nan elements is [0.16, 1, 0.16, 0.9999, 0.0001, 0.16, 0.101] , which has 7 elements. 对于我们用于实验的列表/数组, nan元素的最长序列为[nan, nan, nan] ,它具有三个元素,非nan元素的最长序列为[0.16, 1, 0.16, 0.9999, 0.0001, 0.16, 0.101] ,其中包含7个元素。

max will select the longer of these two sequences and return 7 in both cases. max将选择这两个序列中的较长者,在两种情况下均返回7

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM