Since we're on the topic of peculiarities surrounding numpy's nan
, I've discovered something that I don't understand either. I'm posting this question mainly as an extension of MSeifert's since it seems there might be a common reason for both of our observations.
Earlier on, I posted a solution that involves using itertools.groupby
on a sequence containing nan
values:
return max((sum(1 for _ in group) for key, group in groupby(sequence) if key is nan), default=0)
However, I saw this answer on MSeifert's question linked above which shows an alternative way I might have formulated this algorithm:
return max((sum(1 for _ in group) for key, group in groupby(sequence, np.isnan)), default=0)
Experiment
I've tested both of these variations with both lists and numpy arrays. The code and results are included below:
from itertools import groupby
from numpy import nan
import numpy as np
def longest_nan_run(sequence):
return max((sum(1 for _ in group) for key, group in groupby(sequence) if key is nan), default=0)
def longest_nan_run_2(sequence):
return max((sum(1 for _ in group) for key, group in groupby(sequence, np.isnan)), default=0)
if __name__ == '__main__':
nan_list = [nan, nan, nan, 0.16, 1, 0.16, 0.9999, 0.0001, 0.16, 0.101, nan, 0.16]
nan_array = np.array(nan_list)
print(longest_nan_run(nan_list)) # 3 - correct
print(longest_nan_run_2(nan_list)) # 7 - incorrect
print(longest_nan_run(nan_array)) # 0 - incorrect
print(longest_nan_run_2(nan_array)) # 7 - incorrect
Analysis
np.isnan
) seems to work the same way for both lists and arrays. nan
values when checking arrays . Can anyone explain these results? Again, as this question is related to MSeifert's, it's possible that an explanation of his results would explain mine too (or vice versa).
Further Investigation
To get a better picture of what's happening, I tried printing out the groups generated by groupby
:
def longest_nan_run(sequence):
print(list(list(group) for key, group in groupby(sequence) if key is nan))
return max((sum(1 for _ in group) for key, group in groupby(sequence) if key is nan), default=0)
def longest_nan_run_2(sequence):
print(list(list(group) for _, group in groupby(sequence, np.isnan)))
return max((sum(1 for _ in group) for key, group in groupby(sequence, np.isnan)), default=0)
One fundamental difference (which in retrospect makes sense) is that the original function (using if key is nan
) will filter out everything except nan
values, so all generated groups will consist only of nan
values, like this:
[[nan, nan, nan], [nan]]
On the other hand, the modified function will group all non- nan
values into their own groups, like this:
[[nan, nan, nan], [0.16, 1.0, 0.16, 0.99990000000000001, 0.0001, 0.16, 0.10100000000000001], [nan], [0.16]]
This explains why the modified function returns 7
in both cases - it's considering values as either " nan
" or "not nan
" and returning the longest contiguous series of either.
This also means that I was wrong about my assumptions of how groupby(sequence, keyfunc)
works, and that the modified function is not a viable alternative to the original.
I'm still not sure about the difference in results when running the original function on lists and arrays, though.
Item access in numpy arrays behaves different than in lists:
nan_list[0] == nan_list[1]
# False
nan_list[0] is nan_list[1]
# True
nan_array[0] == nan_array[1]
# False
nan_array[0] is nan_array[1]
# False
x = np.array([1])
x[0] == x[0]
# True
x[0] is x[0]
# False
While the list contains references to the same object, numpy arrays 'contain' only a region of memory and create new Python objects on the fly each time an element is accessed. (Thank you user2357112, for pointing out the inaccuracy in phrasing.)
Makes sense, right? Same object returned by the list, different objects returned by the array - obviously groupby
internally uses is
for comparison... But wait, it's not that easy! Why does groupby(np.array([1, 1, 1, 2, 3]))
work correctly?
The answer is buried in the the itertools C source , line 90 shows that the function PyObject_RichCompareBool
is used for comparing two keys.
rcmp = PyObject_RichCompareBool(gbo->tgtkey, gbo->currkey, Py_EQ);
Although this is basically equivalent to using ==
in Python, the docs note one speciality:
Note If o1 and o2 are the same object,
PyObject_RichCompareBool()
will always return1
forPy_EQ
and0
forPy_NE
.
This means that actually this comparison is performed (equivalent code):
if o1 is o2:
return True
else:
return o1 == o2
So for lists, we have the same nan
objects, which is identified as equal. In contrast, arrays give us different objects with value nan
, which are compared with ==
- but nan == nan
always evaluates as False
.
Alright, I think I've painted a clear enough picture for myself of what's going on.
There two factors at play here:
keyfunc
argument did for groupby
. nan
values within arrays and lists, which is best explained in this answer . Explaining the keyfunc
factor
From the documentation on groupby
:
It generates a break or new group every time the value of the key function changes
From the documentation on np.isnan
:
For scalar input, the result is a new boolean with value True if the input is NaN; otherwise the value is False.
Based on these two things, we deduce that when we set keyfunc
as np.isnan
, each element in the sequence passed to groupyby
will be mapped to either True
or False
, depending on whether it is a nan
or not. This means that the key function will only change at the boundary between nan
elements and non- nan
elements, and therefore that groupby
will only split the sequence into contiguous blocks of nan
and non- nan
elements.
In contrast, the original function (which used groupby(sequence) if key is nan
) will use the identity function for keyfunc
(its default value). This naturally leads into the nuances of nan
identity which is explained below (and in the linked answer above), but the important point here is that the if key is nan
will filter out all groups keyed on non- nan
elements.
Explaining nuances in nan
identity
As better explained in the answer I linked above, all instances of nan
that occur within Python's built-in lists seem to be one and the same instance . In other words, all occurrences of nan
in lists point to the same place in memory. In contrast to this, nan
elements are generated on the fly when using numpy arrays and so are all separate objects.
This is demonstrated using the code below:
def longest_nan_run(sequence):
print(id(nan))
print([id(x) for x in sequence])
return max((sum(1 for _ in group) for key, group in groupby(sequence) if key is nan), default=0)
When I run this using the list defined in the original question, I obtain this output (identical elements are highlighted):
[, , , 4436730432, 4435753536, 4436730432, 4436730192, 4436730048, 4436730432, 4436730552, , 4436730432]
On the other hand, array elements seem to be handled in memory very differently:
4343850232 [, 4357386720, , 4357386720, , 4357386720, , 4357386720, ,, 4357386720, , 4357386720]
The function seems to alternate between two separate locations in memory for storing these values. Notice that none of the elements are identical to the nan
used in the filter condition.
We can now apply all this information we've gathered to the four separate cases used in the experiment to explain our observations.
Original function with lists
During this case, we use the default identity
function as keyfunc
, and we've seen that each occurrence of nan
in lists are in fact all the same instance. The nan
used in the filter conditional if key is nan
is also identical to the nan
elements in the list, causing groupby
to break the list at appropriate places and only retain the groups containing nan
. This is why this variant works and we obtain the correct result of 3
.
Original function with arrays
Again, we use the default identity
function as keyfunc
, but this time all nan
occurrences - including the one in the filter conditional - point to different objects. This means that the conditional filter if key is nan
will fail for all groups. Since we can't find the maximum of an empty collection, we fall back on the default value of 0
.
Modified function with lists and arrays
In both of these cases, we use np.isnan
as keyfunc
. This will cause groupby
to split the sequence into contiguous sequences of nan
and non- nan
elements.
For the list/array we used for our experiment, the longest sequence of nan
elements is [nan, nan, nan]
, which has three elements, and the longest sequence of non- nan
elements is [0.16, 1, 0.16, 0.9999, 0.0001, 0.16, 0.101]
, which has 7 elements.
max
will select the longer of these two sequences and return 7
in both cases.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.