简体   繁体   中英

pandas groupby count values over dynamic threshold

I have a very large dataframe with 2 columns, name, value. I also have a dictionary with thresholds. Example:

names = ['A', 'B', 'C']
df = pd.DataFrame(columns=['key', 'val'], index=range(20))
for i in range(10):
  val = np.random.randint(10)
  k = np.random.choice(names)
  df.loc[i] = [k, val]


thrsholds = {'A': 3, 'B': 5}

i want to get a count for "how many instances, per key, exceed the threshold".

For example for the following df:

在此处输入图像描述

The result should be: {'C': 0, 'B': 2, 'A': 3}

Notice that my df is huge, about 10GB, so most of the calculations i do on df.groupby('key'). How can this task be done in reasonable time (either on the "groupby" or in the df)? which means, I don't want to iterate row-by-row, it'll take too much time.


Edit, current benchmark for solutions below:

Test code:

import pandas as pd
import numpy as np
from datetime import datetime

count = 100000000
np.random.seed(16)

names = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l',
         'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
df = pd.DataFrame(columns=['key', 'val'], index=range(count))
df['key'] = np.random.choice(names, count)
df['val'] = np.random.randint(10, size=count)
thrsholds = {'a': 3, 'b': 5, 'f': 8}
# ------------------------------------------------------------------------------------------

start_t = datetime.now()

df['thresholds'] = df.key.map(thrsholds)
out = df[df.val.gt(df.thresholds)].groupby('key')['val'].count().to_dict()
uniq_keys = df.key.unique().tolist()
for i in uniq_keys:
    if not i in out:
        out[i] = 0

duration = datetime.now() - start_t
print("Option A")
print(out)
print(f"Time: {duration}")
# ------------------------------------------------------------------------------------------
start_t = datetime.now()

g = df.groupby('key')
res = dict()
for n, g_t in g:
    if n in thrsholds.keys():
        res[n] = len(g_t[g_t['val'] > thrsholds[n]])
    else:
        res[n] = 0

duration = datetime.now() - start_t
print("Option B")
print(res)
print(f"Time: {duration}")

Results:

Option A
Time: 0:00:16.400732
Option B
Time: 0:00:21.000255

Use Series.map with Groupby.count :

In [2372]: df['thresholds'] = df.key.map(thrsholds)

In [2381]: out = df[df.val.gt(df.thresholds)].groupby('key')['val'].count().to_dict()

In [2397]: uniq_keys = df.key.unique().tolist()

In [2401]: for i in uniq_keys:
      ...:     if not i in out:
      ...:         out[i] = 0
      ...: 
grps = df.groupby('key')
res = dict()
for n,g in grps:
    if n in thrsholds.keys():
        res[n] = len(g[g['val'] > thrsholds[n]])
    else:
        res[n] = 0

You could possibly write it as an unreadable one-liner dict comprehension;-)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM