[英]Why is pandas nlargest slower than mine?
我有一個數據框
ID CAT SCORE
0 0 0 8325804
1 0 1 1484405
... ... ... ...
1999980 99999 0 4614037
1999981 99999 1 1818470
我按ID
對數據進行分組,並希望知道每個ID最高得分的2個類別。 我可以看到兩個解決方案:
df2 = df.groupby('ID').apply(lambda g: g.nlargest(2, columns='SCORE'))
或者手動將其轉換為元組列表,對元組進行排序,刪除除2之外的每個ID,然后轉換回數據幀。 第一個應該比第二個快,但我觀察到手動解決方案更快。
為什么手動nlargest比熊貓解決方案更快?
import numpy as np
import pandas as pd
import time
def create_df(n=10**5, categories=20):
np.random.seed(0)
df = pd.DataFrame({'ID': [id_ for id_ in range(n) for c in range(categories)],
'CAT': [c for id_ in range(n) for c in range(categories)],
'SCORE': np.random.randint(10**7, size=n * categories)})
return df
def are_dfs_equal(df1, df2):
columns = sorted(df1.columns)
if len(df1.columns) != len(df2.columns):
return False
elif not all(el1 == el2 for el1, el2 in zip(columns, sorted(df2.columns))):
return False
df1_list = [tuple(x) for x in df1[columns].values]
df1_list = sorted(df1_list, reverse=True)
df2_list = [tuple(x) for x in df2[columns].values]
df2_list = sorted(df2_list, reverse=True)
is_same = df1_list == df2_list
return is_same
def manual_nlargest(df, n=2):
df_list = [tuple(x) for x in df[['ID', 'SCORE', 'CAT']].values]
df_list = sorted(df_list, reverse=True)
l = []
current_id = None
current_id_count = 0
for el in df_list:
if el[0] != current_id:
current_id = el[0]
current_id_count = 1
else:
current_id_count += 1
if current_id_count <= n:
l.append(el)
df = pd.DataFrame(l, columns=['ID', 'SCORE', 'CAT'])
return df
df = create_df()
t0 = time.time()
df2 = df.groupby('ID').apply(lambda g: g.nlargest(2, columns='SCORE'))
t1 = time.time()
print('nlargest solution: {:0.2f}s'.format(t1 - t0))
t0 = time.time()
df3 = manual_nlargest(df, n=2)
t1 = time.time()
print('manual nlargest solution: {:0.2f}s'.format(t1 - t0))
print('is_same: {}'.format(are_dfs_equal(df2, df3)))
給
nlargest solution: 97.76s
manual nlargest solution: 4.62s
is_same: True
我想你可以用這個:
df.sort_values(by=['SCORE'],ascending=False).groupby('ID').head(2)
這與使用pandas groupby上的Sort / head功能的手動解決方案相同。
t0 = time.time()
df4 = df.sort_values(by=['SCORE'],ascending=False).groupby('ID').head(2)
t1 = time.time()
df4_list = [tuple(x) for x in df4[['ID', 'SCORE', 'CAT']].values]
df4_list = sorted(df4_list, reverse=True)
is_same = df3_list == df4_list
print('SORT/HEAD solution: {:0.2f}s'.format(t1 - t0))
print(is_same)
給
SORT/HEAD solution: 0.08s
True
timeit
77.9 ms ± 7.91 ms per loop (mean ± std. dev. of 7 runs, 10 loops each).
至於為什么nlargest
比其他解決方案慢?,我想為每個組調用它會產生開銷( %prun
在30.293秒內顯示15764409函數調用(15464352原始調用))。
對於這個解決方案(1578函數調用(1513基元調用)在0.078秒內)
這是一個比你的手動解決方案更快的解決方案,除非我犯了錯誤;)我想nlargest()不是解決這個問題的最快方法,如果速度是你需要的,但它是更可讀的解決方案。
t0 = time.time()
df4 = df.sort_values(by=['ID', 'SCORE'], ascending=[True, False])
df4['cumcount'] = df4.groupby('ID')['SCORE'].cumcount()
df4 = df4[df4['cumcount'] < 2]
t1 = time.time()
print('cumcount solution: {:0.2f}s'.format(t1 - t0))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.