有沒有一種方法可以更快地運行此Python代碼段？

Question

from collections import defaultdict
dct = defaultdict(list)
for n in range(len(res)):
    for i in indices_ordered:
        dct[i].append(res[n][i])

請注意， res是長度為5000的pandas系列的列表， indices_ordered是長度為20000的字符串的列表。在我的Mac（2.3 GHz Intel Core i5和16 GB 2133 MHz LPDDR3）中，需要23分鍾才能運行此代碼。 我對Python很陌生，但是我覺得更聰明的編碼（也許更少的循環）會有所幫助。

編輯：

這是一個如何創建數據（ res和indices_ordered ）以使其能夠在代碼段上方運行的示例（該代碼段稍作更改以訪問唯一字段而不是按字段名稱，因為我找不到如何用字段構造內聯Series的indices_ordered ）名稱）

import random, string, pandas
index_sz = 20000
res_sz = 5000
indices_ordered = [''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(10)) for i in range(index_sz)]
res = [pandas.Series([random.randint(0,10) for i in range(index_sz)], index = random.sample(indices_ordered, index_sz)) for i in range(res_sz)]

Answer 1

編輯：現在可以使用測試數據，很顯然，下面的更改對運行時沒有影響。 所描述的技術僅在內部循環非常有效時（大約5-10 dict查找）才有效，通過刪除某些所述查找，它仍然更加有效。 在這里， r[i]項查找使其他任何事物相差一個數量級 ，因此優化根本不相關。

您的外部循環需要進行5000次迭代，而內部循環需要進行20000次迭代。 這意味着您將在23分鍾內執行1億次迭代，即每次迭代需要13.8μs。 即使在Python中，這種速度也不快。

我會嘗試通過從內部循環中剝離所有不必要的工作來減少運行時間。 特別：

將for n in range(len(res))轉換for n in range(len(res)) res[n]將for r in res轉換for r in res 。 我不知道大熊貓中物品的查找效率如何，但最好是在外部而不是在內部循環中進行。
將score屬性查找移到外部循環。
擺脫defaultdict並預先創建列表並使用普通dict。
完全避免使用dict存儲並直接處理列表，並按順序預先創建它們。 僅在最后創建字典。
緩存append列表方法的查找，並預先准備內部循環所需的(append, i)對。

這是實現以上建議的代碼：

# pre-create the lists
lsts = [[] for _ in range(len(indices_ordered))]
# prepare the pairs (appendfn, i)
fast_append = [(l.append, i)
               for (l, i) in zip(lsts, indices_ordered)]

for r in res:
    # pre-fetch res[n].score
    r_score = r.score
    for append, i in fast_append:
        append(r_score[i])

# finally, create the dict out of the lists
dct = {i: lst for (i, lst) in zip(indices_ordered, lsts)}

Answer 2

這里的問題是您要遍歷每個單個值的indices_ordered 。 只需刪除indices_ordered 。 剝它早在數量級測試時序：

import random
import string

import numpy as np
import pandas as pd

from collections import defaultdict


index_sz = 200
res_sz = 50
indices_ordered = [''.join(random.choice(string.ascii_uppercase + string.digits)
                   for _ in range(10)) for i in range(index_sz)]

res = [pd.Series([random.randint(0,10) for i in range(index_sz)],
                  index = random.sample(indices_ordered, index_sz))
       for i in range(res_sz)]


def your_way(res, indices_ordered):
    dct = defaultdict(list)
    for n in range(len(res)):
        for i in indices_ordered:
            dct[i].append(res[n][i])


def my_way(res):
    dct = defaultdict(list)
    for item in res:
        for string_item, value in item.iteritems():
            dct[string_item].append(value)

給出：

%timeit your_way(res, indices_ordered)
160 ms ± 5.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit my_way(res)
6.79 ms ± 47.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

這減少了整個方法的時間復雜度，因為您不必每次都執行indicies_ordered並分配值，因此隨着數據大小的增加，差異將變得更加明顯。

僅增加一個數量級：

index_sz = 2000
res_sz = 500

給出：

%timeit your_way(res, indices_ordered)
17.8 s ± 999 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit my_way(res)
543 ms ± 9.07 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Answer 3

您確實應該使用DataFrame 。

這是一種直接創建數據的方法：

import pandas as pd
import numpy as np
import random
import string
index_sz = 3
res_sz = 10

indices_ordered = [''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(3)) for i in range(index_sz)]

df = pd.DataFrame(np.random.randint(10, size=(res_sz, index_sz)), columns=indices_ordered)

無需對任何內容進行排序或索引。 基本上可以將DataFrame作為數組或dict訪問。

它應該比處理defaultdict，列表和Series快得多。

df現在看起來像：

>>> df
   7XQ  VTV  38Y
0    6    9    5
1    5    5    4
2    6    0    7
3    0    0    8
4    7    8    9
5    8    6    4
6    2    4    9
7    3    2    2
8    7    6    0
9    8    0    1

>>> df['7XQ']
0    6
1    5
2    6
3    0
4    7
5    8
6    2
7    3
8    7
9    8
Name: 7XQ, dtype: int64

>>> df['7XQ'][:5]
0    6
1    5
2    6
3    0
4    7
Name: 7XQ, dtype: int64

使用原始大小，此腳本在我的筆記本電腦上不到3秒的時間內輸出了5000 rows × 20000 columns DataFrame。

Answer 4

在pd.Series對象的輸入列表上使用pandas magic（帶有2行代碼）：

all_data = pd.concat([*res])
d = all_data.groupby(all_data.index).apply(list).to_dict()

暗示的動作：

pd.concat([*res]) -將所有系列連接為一個單個的，保留每個系列對象的索引（ pandas.concat ）
all_data.groupby(all_data.index).apply(list).to_dict() -在all_data.index確定一組具有相同索引標簽值的all_data.index ，然后將每個組值放入具有.apply(list)的列表中，並最終轉換分組結果放入字典.to_dict() （ pandas.Series.groupby ）

有沒有一種方法可以更快地運行此Python代碼段？

問題描述

4 個解決方案

解決方案1
3 2019-09-17 18:56:30

解決方案2
3 已采納 2019-09-17 20:14:11

解決方案3
2 2019-09-17 19:36:16

解決方案4
2 2019-09-17 19:57:38

有沒有一種方法可以更快地運行此Python代碼段？

問題描述

4 個解決方案

解決方案1 3 2019-09-17 18:56:30

解決方案2 3 已采納 2019-09-17 20:14:11

解決方案3 2 2019-09-17 19:36:16

解決方案4 2 2019-09-17 19:57:38

解決方案1
3 2019-09-17 18:56:30

解決方案2
3 已采納 2019-09-17 20:14:11

解決方案3
2 2019-09-17 19:36:16

解決方案4
2 2019-09-17 19:57:38