I have a very large dataframe with 2 columns, name, value. I also have a dictionary with thresholds. Example:
names = ['A', 'B', 'C']
df = pd.DataFrame(columns=['key', 'val'], index=range(20))
for i in range(10):
val = np.random.randint(10)
k = np.random.choice(names)
df.loc[i] = [k, val]
thrsholds = {'A': 3, 'B': 5}
i want to get a count for "how many instances, per key, exceed the threshold".
For example for the following df:
The result should be: {'C': 0, 'B': 2, 'A': 3}
Notice that my df is huge, about 10GB, so most of the calculations i do on df.groupby('key'). How can this task be done in reasonable time (either on the "groupby" or in the df)? which means, I don't want to iterate row-by-row, it'll take too much time.
Edit, current benchmark for solutions below:
Test code:
import pandas as pd
import numpy as np
from datetime import datetime
count = 100000000
np.random.seed(16)
names = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l',
'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
df = pd.DataFrame(columns=['key', 'val'], index=range(count))
df['key'] = np.random.choice(names, count)
df['val'] = np.random.randint(10, size=count)
thrsholds = {'a': 3, 'b': 5, 'f': 8}
# ------------------------------------------------------------------------------------------
start_t = datetime.now()
df['thresholds'] = df.key.map(thrsholds)
out = df[df.val.gt(df.thresholds)].groupby('key')['val'].count().to_dict()
uniq_keys = df.key.unique().tolist()
for i in uniq_keys:
if not i in out:
out[i] = 0
duration = datetime.now() - start_t
print("Option A")
print(out)
print(f"Time: {duration}")
# ------------------------------------------------------------------------------------------
start_t = datetime.now()
g = df.groupby('key')
res = dict()
for n, g_t in g:
if n in thrsholds.keys():
res[n] = len(g_t[g_t['val'] > thrsholds[n]])
else:
res[n] = 0
duration = datetime.now() - start_t
print("Option B")
print(res)
print(f"Time: {duration}")
Results:
Option A
Time: 0:00:16.400732
Option B
Time: 0:00:21.000255
Use Series.map
with Groupby.count
:
In [2372]: df['thresholds'] = df.key.map(thrsholds)
In [2381]: out = df[df.val.gt(df.thresholds)].groupby('key')['val'].count().to_dict()
In [2397]: uniq_keys = df.key.unique().tolist()
In [2401]: for i in uniq_keys:
...: if not i in out:
...: out[i] = 0
...:
grps = df.groupby('key')
res = dict()
for n,g in grps:
if n in thrsholds.keys():
res[n] = len(g[g['val'] > thrsholds[n]])
else:
res[n] = 0
You could possibly write it as an unreadable one-liner dict comprehension;-)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.