[英]Fill dictonary values as the sum of values from a pandas dataframe
我有一個字典,其中包含所有值設置為None
的各種播放器的名稱,如此...
players = {'A': None,
'B': None,
'C': None,
'D': None,
'E': None}
包含鍵的pandas數據框(df_1),即播放器名稱
col_0 col_1 col_2
----- ----- -----
0 A B C
1 A E D
2 C B A
和包含相應匹配分數的數據幀(df_2)
score_0 score_1 score_2
----- ----- -----
0 1 10 2
1 6 15 7
2 8 1 9
因此,A的總分是......
1 + 6 + 9 = 16
(0, score_0) + (1, score_0) + (2, score_2)
我想把所有球員(A,B,C ..)映射到我之前創建的球員詞典中的總得分。
這是我寫的一些代碼......
for player in players:
players[player] = df_2.loc[df_1['col_0'] == player, 'score_0'].sum()
players[player] += df_2.loc[df_1['col_1'] == player, 'score_1'].sum()
players[player] += df_2.loc[df_1['col_2'] == player, 'score_2'].sum()
print(players)
這會產生預期的結果,但我想知道是否有更快,更像熊貓的方式。 任何幫助,將不勝感激。
你可以生成這樣的字典:
import numpy as np
result = { k: np.nansum(df_2[df_1 == k]) for k in players }
對於給定的示例數據,這將返回:
>>> { k: np.nansum(df_2[df_1 == k]) for k in players }
{'A': 16.0, 'B': 11.0, 'C': 10.0, 'D': 7.0, 'E': 15.0}
如果給定密鑰沒有值,則它將映射到零。 例如,如果我們向players
添加一個鍵R
:
>>> players['R'] = None
>>> { k: np.nansum(df_2[df_1 == k]) for k in players }
{'A': 16.0, 'B': 11.0, 'C': 10.0, 'D': 7.0, 'E': 15.0, 'R': 0.0}
或者我們可以通過首先從數據幀中提取numpy數組來提高效率:
arr_2 = df_2.values
arr_1 = df_1.values
result = { k: arr_2[arr_1 == k].sum() for k in players }
如果我們定義函數f
(原始實現) g
(這個實現)和h
(@ WeNYoBen的實現),並且我們使用timeit
來測量給定樣本數據的1000次調用的時間,我得到以下內容用於Intel英特爾(R)Core(TM)i7-7500U CPU @ 2.70GHz(不幸的是,目前有點笨拙):
>>> df_1 = pd.DataFrame({'col_0': ['A', 'A', 'C'], 'col_1': ['B', 'E', 'B'], 'col_2': ['C', 'D', 'A']})
>>> df_2 = pd.DataFrame({'score_0': [1, 6, 8], 'score_1': [10, 15, 1], 'score_2': [2, 7, 9]})
>>> def f():
... for player in players:
... players[player] = df_2.loc[df_1['col_0'] == player, 'score_0'].sum()
... players[player] += df_2.loc[df_1['col_1'] == player, 'score_1'].sum()
... players[player] += df_2.loc[df_1['col_2'] == player, 'score_2'].sum()
... return players
...
>>> def g():
... arr_2 = df_2.values
... arr_1 = df_1.values
... result = { k: arr_2[arr_1 == k].sum() for k in players }
...
>>> def h():
... return df_2.stack().groupby(df_1.stack().values).sum().to_dict()
...
>>> timeit(f, number=1000)
47.23081823496614
>>> timeit(g, number=1000)
0.32561282289680094
>>> timeit(h, number=1000)
8.169926556991413
最重要的優化可能是使用numpy數組而不是在pandas級別執行計算。
嗯pandas
stack
,通常我們可以groupby
后壓平DF
s=df2.stack().groupby(df1.stack().values).sum()
s
A 16
B 11
C 10
D 7
E 15
dtype: int64
s.to_dict()
{'A': 16, 'B': 11, 'C': 10, 'D': 7, 'E': 15}
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.