[英]Why does 'groupby(x, np.isnan)' behave differently to 'groupby(x) if key is nan'?
Since we're on the topic of peculiarities surrounding numpy's nan
, I've discovered something that I don't understand either. 由于我们是围绕numpy的
nan
进行特殊处理的 ,所以我发现了一些我也不理解的东西。 I'm posting this question mainly as an extension of MSeifert's since it seems there might be a common reason for both of our observations. 我发布这个问题主要是作为MSeifert的扩展,因为看来我们两个观察都可能有一个共同的原因。
Earlier on, I posted a solution that involves using itertools.groupby
on a sequence containing nan
values: 早些时候, 我发布了一个解决方案 ,其中涉及在包含
nan
值的序列上使用itertools.groupby
:
return max((sum(1 for _ in group) for key, group in groupby(sequence) if key is nan), default=0)
However, I saw this answer on MSeifert's question linked above which shows an alternative way I might have formulated this algorithm: 但是,我在上面链接的MSeifert问题上看到了这个答案 ,它显示了我可能制定此算法的另一种方法:
return max((sum(1 for _ in group) for key, group in groupby(sequence, np.isnan)), default=0)
Experiment 实验
I've tested both of these variations with both lists and numpy arrays. 我已经使用列表和numpy数组测试了这两种变体。 The code and results are included below:
代码和结果如下:
from itertools import groupby
from numpy import nan
import numpy as np
def longest_nan_run(sequence):
return max((sum(1 for _ in group) for key, group in groupby(sequence) if key is nan), default=0)
def longest_nan_run_2(sequence):
return max((sum(1 for _ in group) for key, group in groupby(sequence, np.isnan)), default=0)
if __name__ == '__main__':
nan_list = [nan, nan, nan, 0.16, 1, 0.16, 0.9999, 0.0001, 0.16, 0.101, nan, 0.16]
nan_array = np.array(nan_list)
print(longest_nan_run(nan_list)) # 3 - correct
print(longest_nan_run_2(nan_list)) # 7 - incorrect
print(longest_nan_run(nan_array)) # 0 - incorrect
print(longest_nan_run_2(nan_array)) # 7 - incorrect
Analysis 分析
np.isnan
) seems to work the same way for both lists and arrays. np.isnan
)对于列表和数组似乎都以相同的方式工作。 nan
values when checking arrays . nan
值。 Can anyone explain these results? 谁能解释这些结果? Again, as this question is related to MSeifert's, it's possible that an explanation of his results would explain mine too (or vice versa).
同样,由于这个问题与MSeifert有关,因此对他的结果的解释也可能解释我的观点(反之亦然)。
Further Investigation 进一步的调查
To get a better picture of what's happening, I tried printing out the groups generated by groupby
: 为了更好地了解正在发生的事情,我尝试打印出
groupby
生成的组:
def longest_nan_run(sequence):
print(list(list(group) for key, group in groupby(sequence) if key is nan))
return max((sum(1 for _ in group) for key, group in groupby(sequence) if key is nan), default=0)
def longest_nan_run_2(sequence):
print(list(list(group) for _, group in groupby(sequence, np.isnan)))
return max((sum(1 for _ in group) for key, group in groupby(sequence, np.isnan)), default=0)
One fundamental difference (which in retrospect makes sense) is that the original function (using if key is nan
) will filter out everything except nan
values, so all generated groups will consist only of nan
values, like this: 一个根本的区别(回想起来很有意义)是原始函数(
if key is nan
)将过滤除 nan
值以外的所有内容,因此所有生成的组将仅由nan
值组成,如下所示:
[[nan, nan, nan], [nan]]
On the other hand, the modified function will group all non- nan
values into their own groups, like this: 另一方面, 修改后的函数会将所有非
nan
值分组为自己的组,如下所示:
[[nan, nan, nan], [0.16, 1.0, 0.16, 0.99990000000000001, 0.0001, 0.16, 0.10100000000000001], [nan], [0.16]]
This explains why the modified function returns 7
in both cases - it's considering values as either " nan
" or "not nan
" and returning the longest contiguous series of either. 这就解释了为什么修改后的函数在两种情况下都返回
7
它考虑将值视为“ nan
”或“ not nan
”,并返回最长的连续序列。
This also means that I was wrong about my assumptions of how groupby(sequence, keyfunc)
works, and that the modified function is not a viable alternative to the original. 这也意味着我对
groupby(sequence, keyfunc)
工作原理的假设是错误的,并且修改后的函数不是原始函数的可行替代方案。
I'm still not sure about the difference in results when running the original function on lists and arrays, though. 但是,我仍然不确定在列表和数组上运行原始函数时结果的差异。
Item access in numpy arrays behaves different than in lists: numpy数组中的项目访问行为与列表不同:
nan_list[0] == nan_list[1]
# False
nan_list[0] is nan_list[1]
# True
nan_array[0] == nan_array[1]
# False
nan_array[0] is nan_array[1]
# False
x = np.array([1])
x[0] == x[0]
# True
x[0] is x[0]
# False
While the list contains references to the same object, numpy arrays 'contain' only a region of memory and create new Python objects on the fly each time an element is accessed. 虽然列表包含对同一对象的引用,但numpy数组仅“包含”内存区域,并且每次访问元素时都会动态创建新的Python对象。 (Thank you user2357112, for pointing out the inaccuracy in phrasing.)
(感谢user2357112,指出了措词上的不准确性。)
Makes sense, right? 有道理吧? Same object returned by the list, different objects returned by the array - obviously
groupby
internally uses is
for comparison... But wait, it's not that easy! 列表返回的是同一对象,数组返回的是不同的对象-显然
groupby
内部使用is
为了进行比较……但是,这并不是那么容易! Why does groupby(np.array([1, 1, 1, 2, 3]))
work correctly? 为什么
groupby(np.array([1, 1, 1, 2, 3]))
正常工作?
The answer is buried in the the itertools C source , line 90 shows that the function PyObject_RichCompareBool
is used for comparing two keys. 答案隐藏在itertools C源代码中 ,第90行显示函数
PyObject_RichCompareBool
用于比较两个键。
rcmp = PyObject_RichCompareBool(gbo->tgtkey, gbo->currkey, Py_EQ);
Although this is basically equivalent to using ==
in Python, the docs note one speciality: 尽管这基本上等效于在Python中使用
==
,但是文档注意到一种特殊性:
Note If o1 and o2 are the same object,
PyObject_RichCompareBool()
will always return1
forPy_EQ
and0
forPy_NE
.注意:如果O1和O2是同一个对象,
PyObject_RichCompareBool()
将始终返回1
的Py_EQ
和0
的Py_NE
。
This means that actually this comparison is performed (equivalent code): 这意味着实际上执行了此比较(等效代码):
if o1 is o2:
return True
else:
return o1 == o2
So for lists, we have the same nan
objects, which is identified as equal. 因此,对于列表,我们具有相同的
nan
对象,它们被标识为相等。 In contrast, arrays give us different objects with value nan
, which are compared with ==
- but nan == nan
always evaluates as False
. 相反,数组为我们提供了值
nan
不同对象,它们与==
-进行比较,但nan == nan
总是求值为False
。
Alright, I think I've painted a clear enough picture for myself of what's going on. 好吧,我想我已经为自己画了足够清楚的图画。
There two factors at play here: 这里有两个因素在起作用:
keyfunc
argument did for groupby
. keyfunc
参数对groupby
所做的误解。 nan
values within arrays and lists, which is best explained in this answer . nan
值的(非常有趣的)故事,在此答案中得到了最好的解释。 Explaining the keyfunc
factor 解释
keyfunc
From the documentation on groupby
: 从
groupby
的文档中 :
It generates a break or new group every time the value of the key function changes
每当关键功能的值改变时,它都会产生一个中断或新的组
From the documentation on np.isnan
: 从
np.isnan
的文档中 :
For scalar input, the result is a new boolean with value True if the input is NaN;
对于标量输入,如果输入为NaN,则结果为值为True的新布尔值;否则为false。 otherwise the value is False.
否则,值为False。
Based on these two things, we deduce that when we set keyfunc
as np.isnan
, each element in the sequence passed to groupyby
will be mapped to either True
or False
, depending on whether it is a nan
or not. 基于这两件事,我们推断出,当将
keyfunc
设置为np.isnan
,传递给groupyby
的序列中的每个元素groupyby
将被映射为True
或False
,这取决于它是否为nan
。 This means that the key function will only change at the boundary between nan
elements and non- nan
elements, and therefore that groupby
will only split the sequence into contiguous blocks of nan
and non- nan
elements. 这意味着键函数将仅在
nan
元素和非nan
元素之间的边界处发生变化,因此groupby
仅会将序列分为nan
和non- nan
元素的连续块。
In contrast, the original function (which used groupby(sequence) if key is nan
) will use the identity function for keyfunc
(its default value). 相反, 原始函数(
groupby(sequence) if key is nan
,则使用groupby(sequence) if key is nan
)将对keyfunc
(其默认值)使用identity函数 。 This naturally leads into the nuances of nan
identity which is explained below (and in the linked answer above), but the important point here is that the if key is nan
will filter out all groups keyed on non- nan
elements. 这自然会导致
nan
身份的细微差别,这将在下面进行解释(以及在上面的链接的答案中),但是这里的if key is nan
会过滤掉所有以non- nan
元素为键的组。
Explaining nuances in nan
identity 解释
nan
身份的细微差别
As better explained in the answer I linked above, all instances of nan
that occur within Python's built-in lists seem to be one and the same instance . 正如我在上面链接的答案中更好地解释的那样,出现在Python内置列表中的所有
nan
实例似乎都是一个实例 。 In other words, all occurrences of nan
in lists point to the same place in memory. 换句话说,列表中所有出现的
nan
都指向内存中的同一位置。 In contrast to this, nan
elements are generated on the fly when using numpy arrays and so are all separate objects. 与此相反,使用numpy数组时,会即时生成
nan
元素,所有单独的对象也是如此。
This is demonstrated using the code below: 使用以下代码对此进行了演示:
def longest_nan_run(sequence):
print(id(nan))
print([id(x) for x in sequence])
return max((sum(1 for _ in group) for key, group in groupby(sequence) if key is nan), default=0)
When I run this using the list defined in the original question, I obtain this output (identical elements are highlighted): 使用原始问题中定义的列表运行此命令时,将获得以下输出(突出显示相同的元素):
4436731128 [
4436731128,
44436731128,
44436731128, 4436730432, 4435753536, 4436730432, 4436730192, 4436730048, 4436730432, 4436730552,
44436731128, 4436730432]
On the other hand, array elements seem to be handled in memory very differently: 另一方面,数组元素在内存中的处理方式似乎非常不同:
4343850232 [4357386696, 4357386720,
4357386696, 4357386720,
4357386696, 4357386720,
4357386696, 4357386720,
4357386696,, 4357386720,
4357386696, 4357386720]
The function seems to alternate between two separate locations in memory for storing these values. 该功能似乎在内存中两个单独的位置之间交替以存储这些值。 Notice that none of the elements are identical to the
nan
used in the filter condition. 请注意,没有任何元素与过滤条件中使用的
nan
相同。
We can now apply all this information we've gathered to the four separate cases used in the experiment to explain our observations. 现在,我们可以将收集到的所有信息应用于实验中用来解释我们的观察结果的四个单独案例。
Original function with lists 原始功能清单
During this case, we use the default identity
function as keyfunc
, and we've seen that each occurrence of nan
in lists are in fact all the same instance. 在这种情况下,我们将默认的
identity
函数用作keyfunc
,并且我们已经看到列表中每次出现的nan
实际上都是相同的实例。 The nan
used in the filter conditional if key is nan
is also identical to the nan
elements in the list, causing groupby
to break the list at appropriate places and only retain the groups containing nan
. 的
nan
在过滤器条件中使用if key is nan
也等同于nan
列表中的元素,从而导致groupby
打破列表在适当的地方,仅保留含有基团nan
。 This is why this variant works and we obtain the correct result of 3
. 这就是为什么此变体有效并且我们获得
3
的正确结果的原因。
Original function with arrays 数组的原始功能
Again, we use the default identity
function as keyfunc
, but this time all nan
occurrences - including the one in the filter conditional - point to different objects. 同样,我们将默认的
identity
函数用作keyfunc
,但是这次所有的nan
事件(包括过滤条件中的一次)都指向不同的对象。 This means that the conditional filter if key is nan
will fail for all groups. 这意味着
if key is nan
的条件过滤器将对所有组均失败。 Since we can't find the maximum of an empty collection, we fall back on the default value of 0
. 由于我们找不到空集合的最大值,因此我们使用默认值
0
。
Modified function with lists and arrays 具有列表和数组的修改功能
In both of these cases, we use np.isnan
as keyfunc
. 在这两种情况下,我们都使用
np.isnan
作为keyfunc
。 This will cause groupby
to split the sequence into contiguous sequences of nan
and non- nan
elements. 这将导致
groupby
将序列分为nan
和non- nan
元素的连续序列。
For the list/array we used for our experiment, the longest sequence of nan
elements is [nan, nan, nan]
, which has three elements, and the longest sequence of non- nan
elements is [0.16, 1, 0.16, 0.9999, 0.0001, 0.16, 0.101]
, which has 7 elements. 对于我们用于实验的列表/数组,
nan
元素的最长序列为[nan, nan, nan]
,它具有三个元素,非nan
元素的最长序列为[0.16, 1, 0.16, 0.9999, 0.0001, 0.16, 0.101]
,其中包含7个元素。
max
will select the longer of these two sequences and return 7
in both cases. max
将选择这两个序列中的较长者,在两种情况下均返回7
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.