简体   繁体   中英

Python modified groupby ngroup in cuDF with list comprehension

I am trying to write a function that does something similar to pandas's groupby().ngroups() function . The difference is that I want each subgroup count to restart at 0. So given the following data:

| EVENT_1 | EVENT_2 |
| ------- | ------- |
|       0 |       3 | 
|       0 |       3 |
|       0 |       3 |
|       0 |       5 |
|       0 |       5 |
|       0 |       5 |
|       0 |       9 |
|       0 |       9 |
|       0 |       9 |
|       1 |       6 |
|       1 |       6 |

I want

| EVENT_1 | EVENT_2 | EVENT_2A |
| ------- | ------- | -------- |
|       0 |       3 |        0 |
|       0 |       3 |        0 |
|       0 |       3 |        0 |
|       0 |       5 |        1 |
|       0 |       5 |        1 |
|       0 |       5 |        1 |
|       0 |       9 |        2 |
|       0 |       9 |        2 |
|       1 |       6 |        0 |
|       1 |       6 |        0 |

The best way I can think of implementing this is by doing a groupby() on EVENT_1, within each group getting the unique values of EVENT_2, and then setting EVENT_2A as the index of the unique value. For example, in the EVENT_1 == 0 group, the unique values are [3, 5, 9] and then we set EVENT_2A to the index in the unique values list for the corresponding value in EVENT_2.

The code I have written is here. Note that EVENT_2 is always sorted with respect to EVENT_1 so finding the unique values like this in O(n) should work.

import cudf
from numba import cuda
import numpy as np

def count(EVENT_2, EVENT_2A):
    # Get unique values of EVENT_2
    uq = [EVENT_2[0]] + [x for i, x in enumerate(EVENT_2) if i > 0 and EVENT_2[i-1] != x]

    for i in range(cuda.threadIdx.x, len(EVENT_2), cuda.blockDim.x):
        # Get corresponding index for each value. This can probably be sped up by mapping 
        # values to indices
        for j, v in enumerate(uq):
            if v == EVENT_2[i]:
                EVENT_2A[i] = j
                break


if __name__ == "__main__":
    data = {
        "EVENT_1":[0,0,0,0,0,0,0,0,1,1],
        "EVENT_2":[3,3,3,5,5,5,9,9,6,6]
    }
    df = cudf.DataFrame(data)
    results = df.groupby(["EVENT_1"], method="cudf").apply_grouped(
        count, 
        incols=["EVENT_2"], 
        outcols={"EVENT_2A":np.int64}
    )
    print(results.sort_index())

The problem with this is that there seems to be an error regarding using lists in the user defined function count() . Numba says that its JIT nopython compiler can handle list comprehension and indeed when I use the function

from numba import jit

@jit(nopython=True)
def uq_sorted(my_list):
    return [my_list[0]] + [x for i, x in enumerate(my_list) if i > 0 and my_list[i-1] != x]

it works, although with a deprecation warning.

The error I get using cudf is

No implementation of function Function(<numba.cuda.compiler.DeviceFunctionTemplate object at 0x7f782a179fa0>) found for signature:
 
 >>> count <CUDA device function>(array(int64, 1d, C), array(int64, 1d, C))
 
There are 2 candidate implementations:
  - Of which 2 did not match due to:
  Overload in function 'count <CUDA device function>': File: ../../../../test.py: Line 11.
    With argument(s): '(array(int64, 1d, C), array(int64, 1d, C))':
   Rejected as the implementation raised a specific error:
     TypingError: Failed in nopython mode pipeline (step: nopython frontend)
   Unknown attribute 'append' of type list(undefined)<iv=None>
   
   File "test.py", line 12:
   def count(EVENT_2, EVENT_2A):
       uq = [EVENT_2[0]] + [x for i, x in enumerate(EVENT_2) if i > 0 and EVENT_2[i-1] != x]
       ^
   
   During: typing of get attribute at test.py (12)
   
   File "test.py", line 12:
   def count(EVENT_2, EVENT_2A):
       uq = [EVENT_2[0]] + [x for i, x in enumerate(EVENT_2) if i > 0 and EVENT_2[i-1] != x]
       ^

  raised from /project/conda_env/lib/python3.8/site-packages/numba/core/typeinfer.py:1071

During: resolving callee type: Function(<numba.cuda.compiler.DeviceFunctionTemplate object at 0x7f782a179fa0>)
During: typing of call at <string> (10)


File "<string>", line 10:
<source missing, REPL/exec in use?>

Is this related to the deprecation warning from numba? Even when I set uq as a static list I still get an error. Any solutions to the list comprehension issue, or to my problem as a whole are welcome. Thanks.

Shout out to RAPIDS community member Inzamam who made this elegant solution.

Let's sort out the problem as a whole. You don't need groupby or directly manipulate the dataframe with for loops. That breaks encapsulation and parallelization, losing the benefits of GPU computing. You can exploit the appropriate data structures as they were intended to be used within the dataframe API. Here is an example.

import cudf
import numpy as np #only to create a really large array to test scale

### Your Original data
# data = { 
#         "EVENT_1":[0,0,0,0,0,0,0,0,1,1],
#         "EVENT_2":[3,3,3,5,5,5,9,9,6,6]
#     }

### your data at scale (10,000,000 rows)    
data = {
    "EVENT_1":np.random.default_rng().integers(0,10,10000000),
    "EVENT_2":np.random.default_rng().integers(12,20,10000000)
}
df = cudf.DataFrame(data)

from collections import defaultdict

def ngroup_test(df, col1, col2, col3):
    df[col3] = df[col1].astype(str) + ',' + df[col2].astype(str)
    mapping = {}
    d = {}
    last_index = {}
    for marker in df[col3].unique().to_array():
        first, second = marker.split(',')
        if first not in d:
            d[first] = {second: 0}
            last_index[first] = 1
        elif second not in d[first]:
            d[first][second] = last_index[first]
            last_index[first] += 1
        mapping[marker] = d[first][second]

    col_to_insert = list(map(lambda x: mapping[x], list(df[col3].to_array())))
    df[col3] = col_to_insert
    return df

df1 = ngroup_test(df, 'EVENT_1', 'EVENT_2', 'EVENT_2A')
df1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM