如何优化python压缩代码的速度？

Question

我编写了一个压缩代码，并在10 KB的文本文件上进行了测试，耗时不少于3分钟。 但是，我已经用一个1 MB的文件进行了测试，这是老师指定的评估，并且耗时超过半小时。 与我的同学相比，我的时间不规则。 可能是我的计算机或我的代码，但我不知道。 有谁知道使我的代码速度更短的提示或捷径吗？ 我的压缩代码如下，如果有更快的循环方法，等等。请给我一个答案（：

（通过我的代码可以正常工作的方式，所以我不是在要求更正，只是增强或提示，谢谢！）

import re #used to enable functions(loops, etc.) to find patterns in text file
import os #used for anything referring to directories(files)
from collections import Counter #used to keep track on how many times values are added

size1 = os.path.getsize('file.txt') #find the size(in bytes) of your file,    INCLUDING SPACES
print('The size of your file is ', size1,)

words = re.findall('\w+', open('file.txt').read()) 
wordcounts = Counter(words) #turns all words into array, even capitals 
common100 = [x for x, it in Counter(words).most_common(100)] #identifies the 200 most common words

keyword = []
kcount = []
z = dict(wordcounts)
for key, value in z.items():
    keyword.append(key) #adds each keyword to the array called keywords
    kcount.append(value)

characters =['$','#','@','!','%','^','&','*','(',')','~','-','/','{','[', ']', '+','=','}','|', '?','cb',
         'dc','fd','gf','hg','kj','mk','nm','pn','qp','rq','sr','ts','vt','wv','xw','yx','zy','bc',
         'cd','df','fg','gh','jk','km','mn','np','pq','qr','rs','st','tv','vw','wx','xy','yz','cbc',
         'dcd','fdf','gfg','hgh','kjk','mkm','nmn','pnp','qpq','rqr','srs','tst','vtv','wvw','xwx',
         'yxy','zyz','ccb','ddc','ffd','ggf','hhg','kkj','mmk','nnm','ppn','qqp','rrq','ssr','tts','vvt',
         'wwv','xxw','yyx''zzy','cbb','dcc','fdd','gff','hgg','kjj','mkk','nmm','pnn','qpp','rqq','srr',
         'tss','vtt','wvv','xww','yxx','zyy','bcb','cdc','dfd','fgf','ghg','jkj','kmk','mnm','npn','pqp',
         'qrq','rsr','sts','tvt','vwv','wxw','xyx','yzy','QRQ','RSR','STS','TVT','VWV','WXW','XYX','YZY',
        'DC','FD','GF','HG','KJ','MK','NM','PN','QP','RQ','SR','TS','VT','WV','XW','YX','ZY','BC',
         'CD','DF','FG','GH','JK','KM','MN','NP','PQ','QR','RS','ST','TV','VW','WX','XY','YZ','CBC',
         'DCD','FDF','GFG','HGH','KJK','MKM','NMN','PNP','QPQ','RQR','SRS','TST','VTV','WVW','XWX',
         'YXY','ZYZ','CCB','DDC','FFD','GGF','HHG','KKJ','MMK','NNM','PPN','QQP','RRQ','SSR','TTS','VVT',
         'WWV','XXW','YYX''ZZY','CBB','DCC','FDD','GFF','HGG','KJJ','MKK','NMM','PNN','QPP','RQQ','SRR',
         'TSS','VTT','WVV','XWW','YXX','ZYY','BCB','CDC','DFD','FGF','GHG','JKJ','KMK','MNM','NPN','PQP',] #characters which I can use

symbols_words = []
char = 0
for i in common100:
    symbols_words.append(characters[char]) #makes the array literally contain 0 values
        char = char + 1

print("Compression has now started")

f = 0
g = 0
no = 0
while no < 100:
    for i in common100:
        for w in words:
            if i == w and len(i)>1: #if the values in common200 are ACTUALLY in words
                place = words.index(i)#find exactly where the most common words are in the text
                symbols = symbols_words[common100.index(i)] #assigns one character with one common word
                words[place] = symbols # replaces the word with the symbol
                g = g + 1
    no = no + 1


string = words
stringMade = ' '.join(map(str, string))#makes the list into a string so you can put it into a text file
file = open("compression.txt", "w")
file.write(stringMade)#imports everything in the variable 'words' into the new file
file.close()

size2 = os.path.getsize('compression.txt')

no1 = int(size1)
no2 = int(size2)
print('Compression has finished.')
print('Your original file size has been compressed by', 100 - ((100/no1) * no2 ) ,'percent.'
  'The size of your file now is ', size2)

Answer 1

我认为不利于性能的第一件事是：

for i in common100:
    for w in words:
        if i == w and len(i)>1:
            ...

您正在执行的操作是查看单词w是否在您的common100单词列表中。 但是，可以通过使用集合在O（1）时间内完成此检查，而不必循环浏览每个单词的前100个单词。

common_words = set(common100)
for w in words:
    if w in common_words:
        ...

Answer 2

使用类似

word_substitutes = dict(zip(common100, characters))

将为您提供将常用字映射到其相应符号的字典。

然后，您可以简单地遍历以下单词：

# Iterate over all the words
# Use enumerate because we're going to modify the word in-place in the words list
for word_idx, word in enumerate(words):
    # If the current word is in the `word_substitutes` dict, then we know its in the
    # 'common' words, and can be replaced by the symbol
    if word in word_substitutes:
        # Replaces the word in-place
        replacement_symbol = word_substitutes[word]
        words[word_idx] = replacement_symbol

这将提供更好的性能，因为用于公共单词符号映射的字典查找在时间上是对数的，而不是线性的。 因此，总体复杂度将类似于O（N log（N）），而不是从其中带有.index()调用的两个嵌套循环中获得的O（N ^ 3）。

Answer 3

通常，您将执行以下操作：

测量程序的每个“部分”需要多少时间。 你可以使用一个分析器（如这一个标准库），或者干脆撒一些times.append(time.time.now)到你的代码，并计算差异。 然后，您知道代码的哪一部分很慢。
看看是否可以改进慢速部分的算法。 gnicholas的答案显示了一种加快速度的可能性。 while no<=100似乎很可疑，但也许可以改进。 此步骤需要了解您使用的算法。 请注意为您的用例选择最佳的数据结构。
如果您不能使用更好的算法（因为您始终使用最佳方法来计算某些东西），则需要自己加快计算速度。 numpy带来了数字化的东西，使用cython可以将python代码基本上编译为C，而numba使用LLVM进行编译。

如何优化python压缩代码的速度？

问题描述

3 个解决方案

解决方案1
1 2016-05-15 18:52:10

解决方案2
1 已采纳 2016-05-16 16:33:12

解决方案3
0 2016-05-15 18:58:24

如何优化python压缩代码的速度？

问题描述

3 个解决方案

解决方案1 1 2016-05-15 18:52:10

解决方案2 1 已采纳 2016-05-16 16:33:12

解决方案3 0 2016-05-15 18:58:24

解决方案1
1 2016-05-15 18:52:10

解决方案2
1 已采纳 2016-05-16 16:33:12

解决方案3
0 2016-05-15 18:58:24