Python：如何優化計算？

Question

我正在從一個語料庫中進行一些文本挖掘，並且正在使用3000行這樣的文本文件輸出：

dns 11 11 [2，355，706，1063，3139，3219，3471，3472，3473，4384，4444]

xhtml 8 11 [1651、2208、2815、3487、3517、4480、4481、4504]

javascript 18 18 [49，50，175，176，355，706，1063，1502，1651，2208，2280，2815，3297，4068，4236，4480，4481，4504]

這里有單詞，出現的行數，總出現次數以及這些行的n°。

我正在嘗試計算卡方值，並且該文本文件是我的以下代碼的輸入：

measure = nltk.collocations.BigramAssocMeasures()

dicto = {} 
for i in lines :
    tokens = nltk.wordpunct_tokenize(i)
    m = tokens[0]       #m is the word
    list_i = tokens[4:]
    list_i.pop()
    for x in list_i :
        if x ==',':
            ind = list_i.index(x)
            list_i.pop(ind)
    dicto[m]=list_i #for each word i create a dictionnary with the n° of lines

#for each word I calculate the Chi-squared with every other word 
#and my problem is starting right here i think
#The "for" loop and the z = .....


for word1 in dicto :
    x=dicto[word1]
    vector = []

    for word2 in dicto :    
        y=dicto[word2]
        z=[val for val in x if val in y]

        #Contingency Matrix
        m11 = cpt-(len(x)+len(y)-len(z))
        m12 = len(x)-len(z)
        m21 = len(y)-len(z)
        m22 = len(z)

        n_ii =m11
        n_ix =m11+m21
        n_xi =m11+m12
        n_xx =m11+m12+m21+m22 

        Chi_squared = measure.chi_sq(n_ii, (n_ix, n_xi), n_xx)

        #I compare with the minimum value to check independancy between words
        if Chi_squared >3.841 :
            vector.append([word1, word2 , round(Chi_square,3))

    #The correlations calculated
    #I sort my vector in a descending way
    final=sorted(vector, key=lambda vector: vector[2],reverse = True)

    print word1
    #I take the 4 best scores
    for i in final[:4]:
        print i,

我的問題是，計算量會花費很多時間（我在說小時！），我有什么可以改變的嗎？ 我可以做些什么來改進我的代碼？ 還有其他Python結構嗎？ 有任何想法嗎？

Answer 1

有一些加速的機會，但我首先要關注的是vector 。 它在哪里初始化？ 在發布的代碼中，它獲得n ^ 2個條目並排序n次！ 這似乎是無意的。 應該清除嗎？ 最終應該在循環之外嗎？

final = sorted（vector，key = lambda vector：vector [2]，reverse = True）

可以使用，但作用域比較難看，更好的是：

final = sorted（向量，key = lambda條目：entry [2]，reverse = True）

通常，要解決時序問題，請考慮使用探查器。

Answer 2

首先，如果每個單詞都有唯一的行號，請使用集合而不是列表：找到集合的交集比列表的交集要快得多（尤其是如果列表未排序的話）。

其次，預先計算列表長度-現在您為每個內部循環步驟計算兩次。

第三-使用numpy進行這種計算。

Python：如何優化計算？

問題描述

2 個解決方案

解決方案1
1 已采納 2015-05-15 06:37:14

解決方案2
0 2015-05-15 06:42:30

Python：如何優化計算？

問題描述

2 個解決方案

解決方案1 1 已采納 2015-05-15 06:37:14

解決方案2 0 2015-05-15 06:42:30

解決方案1
1 已采納 2015-05-15 06:37:14

解決方案2
0 2015-05-15 06:42:30