简体   繁体   English

Python:如何优化计算?

[英]Python : How to optimize calculations?

I'm making some text-mining from a corpus of words, and I'm having a textfile output with 3000 lines like this : 我正在从一个语料库中进行一些文本挖掘,并且正在使用3000行这样的文本文件输出:

dns 11 11 [2, 355, 706, 1063, 3139, 3219, 3471, 3472, 3473, 4384, 4444] dns 11 11 [2,355,706,1063,3139,3219,3471,3472,3473,4384,4444]

xhtml 8 11 [1651, 2208, 2815, 3487, 3517, 4480, 4481, 4504] xhtml 8 11 [1651、2208、2815、3487、3517、4480、4481、4504]

javascript 18 18 [49, 50, 175, 176, 355, 706, 1063, 1502, 1651, 2208, 2280, 2815, 3297, 4068, 4236, 4480, 4481, 4504] javascript 18 18 [49,50,175,176,355,706,1063,1502,1651,2208,2280,2815,3297,4068,4236,4480,4481,4504]

There is the word, the number of lines where it've appeared, the number of total appearances, and the n° of these lines. 这里有单词,出现的行数,总出现次数以及这些行的n°。

I'm trying to calculate The Chi-squared Value, and that textfile is the input for my code below : 我正在尝试计算卡方值,并且该文本文件是我的以下代码的输入:

measure = nltk.collocations.BigramAssocMeasures()

dicto = {} 
for i in lines :
    tokens = nltk.wordpunct_tokenize(i)
    m = tokens[0]       #m is the word
    list_i = tokens[4:]
    list_i.pop()
    for x in list_i :
        if x ==',':
            ind = list_i.index(x)
            list_i.pop(ind)
    dicto[m]=list_i #for each word i create a dictionnary with the n° of lines

#for each word I calculate the Chi-squared with every other word 
#and my problem is starting right here i think
#The "for" loop and the z = .....


for word1 in dicto :
    x=dicto[word1]
    vector = []

    for word2 in dicto :    
        y=dicto[word2]
        z=[val for val in x if val in y]

        #Contingency Matrix
        m11 = cpt-(len(x)+len(y)-len(z))
        m12 = len(x)-len(z)
        m21 = len(y)-len(z)
        m22 = len(z)

        n_ii =m11
        n_ix =m11+m21
        n_xi =m11+m12
        n_xx =m11+m12+m21+m22 

        Chi_squared = measure.chi_sq(n_ii, (n_ix, n_xi), n_xx)

        #I compare with the minimum value to check independancy between words
        if Chi_squared >3.841 :
            vector.append([word1, word2 , round(Chi_square,3))

    #The correlations calculated
    #I sort my vector in a descending way
    final=sorted(vector, key=lambda vector: vector[2],reverse = True)

    print word1
    #I take the 4 best scores
    for i in final[:4]:
        print i,

My problem is that the calcul is taking to much time (I'm talking about Hours !!) Is there anything that I can change ? 我的问题是,计算量会花费很多时间(我在说小时!),我有什么可以改变的吗? anything that I do to improve my code ? 我可以做些什么来改进我的代码? Any other Python structures ? 还有其他Python结构吗? any ideas ? 有任何想法吗 ?

There are a few opportunities for speedup, but my first concern is vector . 有一些加速的机会,但我首先要关注的是vector Where is it initialized? 它在哪里初始化? In the code posted, it gets n^2 entries and sorted n times! 在发布的代码中,它获得n ^ 2个条目并排序n次! That seems unintentional. 这似乎是无意的。 Should it be cleared? 应该清除吗? Should final be outside the loop? 最终应该在循环之外吗?

final=sorted(vector, key=lambda vector: vector[2],reverse = True) final = sorted(vector,key = lambda vector:vector [2],reverse = True)

is functional, but has ugly scoping, better is: 可以使用,但作用域比较难看,更好的是:

final=sorted(vector, key=lambda entry: entry[2], reverse=True) final = sorted(向量,key = lambda条目:entry [2],reverse = True)

In general, to solve timing issues consider using a profiler . 通常,要解决时序问题,请考虑使用探查器

First, if for every word you have unique line numbers, use sets instead of lists: finding set intersection is much faster than the intersection of lists (especially if lists are not ordered). 首先,如果每个单词都有唯一的行号,请使用集合而不是列表:找到集合的交集比列表的交集要快得多(尤其是如果列表未排序的话)。

Second, precompute list lengths - now you compute it twice for every single inner cycle step. 其次,预先计算列表长度-现在您为每个内部循环步骤计算两次。

And third - use numpy for this kind of computation. 第三-使用numpy进行这种计算。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM