简体   繁体   中英

Why does 'groupby(x, np.isnan)' behave differently to 'groupby(x) if key is nan'?

Since we're on the topic of peculiarities surrounding numpy's nan , I've discovered something that I don't understand either. I'm posting this question mainly as an extension of MSeifert's since it seems there might be a common reason for both of our observations.

Earlier on, I posted a solution that involves using itertools.groupby on a sequence containing nan values:

return max((sum(1 for _ in group) for key, group in groupby(sequence) if key is nan), default=0)

However, I saw this answer on MSeifert's question linked above which shows an alternative way I might have formulated this algorithm:

return max((sum(1 for _ in group) for key, group in groupby(sequence, np.isnan)), default=0)

Experiment

I've tested both of these variations with both lists and numpy arrays. The code and results are included below:

from itertools import groupby

from numpy import nan
import numpy as np


def longest_nan_run(sequence):
    return max((sum(1 for _ in group) for key, group in groupby(sequence) if key is nan), default=0)


def longest_nan_run_2(sequence):
    return max((sum(1 for _ in group) for key, group in groupby(sequence, np.isnan)), default=0)


if __name__ == '__main__':
    nan_list = [nan, nan, nan, 0.16, 1, 0.16, 0.9999, 0.0001, 0.16, 0.101, nan, 0.16]
    nan_array = np.array(nan_list)

    print(longest_nan_run(nan_list))  # 3 - correct
    print(longest_nan_run_2(nan_list))  # 7 - incorrect
    print(longest_nan_run(nan_array))  # 0 - incorrect
    print(longest_nan_run_2(nan_array))  # 7 - incorrect

Analysis

  • Of all four combinations, only checks against lists using the original function works as expected.
  • The modified function (using np.isnan ) seems to work the same way for both lists and arrays.
  • The original function does not appear to find any nan values when checking arrays .

Can anyone explain these results? Again, as this question is related to MSeifert's, it's possible that an explanation of his results would explain mine too (or vice versa).


Further Investigation

To get a better picture of what's happening, I tried printing out the groups generated by groupby :

def longest_nan_run(sequence):
    print(list(list(group) for key, group in groupby(sequence) if key is nan))
    return max((sum(1 for _ in group) for key, group in groupby(sequence) if key is nan), default=0)


def longest_nan_run_2(sequence):
    print(list(list(group) for _, group in groupby(sequence, np.isnan)))
    return max((sum(1 for _ in group) for key, group in groupby(sequence, np.isnan)), default=0)

One fundamental difference (which in retrospect makes sense) is that the original function (using if key is nan ) will filter out everything except nan values, so all generated groups will consist only of nan values, like this:

[[nan, nan, nan], [nan]]

On the other hand, the modified function will group all non- nan values into their own groups, like this:

[[nan, nan, nan], [0.16, 1.0, 0.16, 0.99990000000000001, 0.0001, 0.16, 0.10100000000000001], [nan], [0.16]]

This explains why the modified function returns 7 in both cases - it's considering values as either " nan " or "not nan " and returning the longest contiguous series of either.

This also means that I was wrong about my assumptions of how groupby(sequence, keyfunc) works, and that the modified function is not a viable alternative to the original.

I'm still not sure about the difference in results when running the original function on lists and arrays, though.

Item access in numpy arrays behaves different than in lists:

nan_list[0] == nan_list[1]
# False
nan_list[0] is nan_list[1]
# True

nan_array[0] == nan_array[1]
# False
nan_array[0] is nan_array[1]
# False

x = np.array([1])
x[0] == x[0]
# True
x[0] is x[0]
# False

While the list contains references to the same object, numpy arrays 'contain' only a region of memory and create new Python objects on the fly each time an element is accessed. (Thank you user2357112, for pointing out the inaccuracy in phrasing.)

Makes sense, right? Same object returned by the list, different objects returned by the array - obviously groupby internally uses is for comparison... But wait, it's not that easy! Why does groupby(np.array([1, 1, 1, 2, 3])) work correctly?

The answer is buried in the the itertools C source , line 90 shows that the function PyObject_RichCompareBool is used for comparing two keys.

rcmp = PyObject_RichCompareBool(gbo->tgtkey, gbo->currkey, Py_EQ);

Although this is basically equivalent to using == in Python, the docs note one speciality:

Note If o1 and o2 are the same object, PyObject_RichCompareBool() will always return 1 for Py_EQ and 0 for Py_NE .

This means that actually this comparison is performed (equivalent code):

if o1 is o2:
    return True
else:
    return o1 == o2

So for lists, we have the same nan objects, which is identified as equal. In contrast, arrays give us different objects with value nan , which are compared with == - but nan == nan always evaluates as False .

Alright, I think I've painted a clear enough picture for myself of what's going on.

There two factors at play here:

  • My own misunderstanding of what the keyfunc argument did for groupby .
  • The (much more interesting) story of how Python represents nan values within arrays and lists, which is best explained in this answer .

Explaining the keyfunc factor

From the documentation on groupby :

It generates a break or new group every time the value of the key function changes

From the documentation on np.isnan :

For scalar input, the result is a new boolean with value True if the input is NaN; otherwise the value is False.

Based on these two things, we deduce that when we set keyfunc as np.isnan , each element in the sequence passed to groupyby will be mapped to either True or False , depending on whether it is a nan or not. This means that the key function will only change at the boundary between nan elements and non- nan elements, and therefore that groupby will only split the sequence into contiguous blocks of nan and non- nan elements.

In contrast, the original function (which used groupby(sequence) if key is nan ) will use the identity function for keyfunc (its default value). This naturally leads into the nuances of nan identity which is explained below (and in the linked answer above), but the important point here is that the if key is nan will filter out all groups keyed on non- nan elements.

Explaining nuances in nan identity

As better explained in the answer I linked above, all instances of nan that occur within Python's built-in lists seem to be one and the same instance . In other words, all occurrences of nan in lists point to the same place in memory. In contrast to this, nan elements are generated on the fly when using numpy arrays and so are all separate objects.

This is demonstrated using the code below:

def longest_nan_run(sequence):
    print(id(nan))
    print([id(x) for x in sequence])
    return max((sum(1 for _ in group) for key, group in groupby(sequence) if key is nan), default=0)

When I run this using the list defined in the original question, I obtain this output (identical elements are highlighted):


[, , , 4436730432, 4435753536, 4436730432, 4436730192, 4436730048, 4436730432, 4436730552, , 4436730432]

On the other hand, array elements seem to be handled in memory very differently:

4343850232
[, 4357386720, , 4357386720, , 4357386720, , 4357386720, ,, 4357386720, , 4357386720]

The function seems to alternate between two separate locations in memory for storing these values. Notice that none of the elements are identical to the nan used in the filter condition.

Case Studies

We can now apply all this information we've gathered to the four separate cases used in the experiment to explain our observations.

Original function with lists

During this case, we use the default identity function as keyfunc , and we've seen that each occurrence of nan in lists are in fact all the same instance. The nan used in the filter conditional if key is nan is also identical to the nan elements in the list, causing groupby to break the list at appropriate places and only retain the groups containing nan . This is why this variant works and we obtain the correct result of 3 .

Original function with arrays

Again, we use the default identity function as keyfunc , but this time all nan occurrences - including the one in the filter conditional - point to different objects. This means that the conditional filter if key is nan will fail for all groups. Since we can't find the maximum of an empty collection, we fall back on the default value of 0 .

Modified function with lists and arrays

In both of these cases, we use np.isnan as keyfunc . This will cause groupby to split the sequence into contiguous sequences of nan and non- nan elements.

For the list/array we used for our experiment, the longest sequence of nan elements is [nan, nan, nan] , which has three elements, and the longest sequence of non- nan elements is [0.16, 1, 0.16, 0.9999, 0.0001, 0.16, 0.101] , which has 7 elements.

max will select the longer of these two sequences and return 7 in both cases.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM