简体   繁体   English

如何优化python压缩代码的速度?

[英]How do I optimize the speed of my python compression code?

I have made a compression code, and have tested it on 10 KB text files, which took no less than 3 minutes. 我编写了一个压缩代码,并在10 KB的文本文件上进行了测试,耗时不少于3分钟。 However, I've tested it with a 1 MB file, which is the assessment assigned by my teacher, and it takes longer than half an hour. 但是,我已经用一个1 MB的文件进行了测试,这是老师指定的评估,并且耗时超过半小时。 Compared to my classmates, mine is irregularly long. 与我的同学相比,我的时间不规则。 It might be my computer or my code, but I have no idea. 可能是我的计算机或我的代码,但我不知道。 Does anyone know any tips or shortcuts into making the speed of my code shorter? 有谁知道使我的代码速度更短的提示或捷径吗? My compression code is below, if there are any quicker ways of doing loops, etc. please send me an answer (: 我的压缩代码如下,如果有更快的循环方法,等等。请给我一个答案(:

(by the way my code DOES work, so I'm not asking for corrections, just enhancements, or tips, thanks!) (通过我的代码可以正常工作的方式,所以我不是在要求更正,只是增强或提示,谢谢!)

import re #used to enable functions(loops, etc.) to find patterns in text file
import os #used for anything referring to directories(files)
from collections import Counter #used to keep track on how many times values are added

size1 = os.path.getsize('file.txt') #find the size(in bytes) of your file,    INCLUDING SPACES
print('The size of your file is ', size1,)

words = re.findall('\w+', open('file.txt').read()) 
wordcounts = Counter(words) #turns all words into array, even capitals 
common100 = [x for x, it in Counter(words).most_common(100)] #identifies the 200 most common words

keyword = []
kcount = []
z = dict(wordcounts)
for key, value in z.items():
    keyword.append(key) #adds each keyword to the array called keywords
    kcount.append(value)

characters =['$','#','@','!','%','^','&','*','(',')','~','-','/','{','[', ']', '+','=','}','|', '?','cb',
         'dc','fd','gf','hg','kj','mk','nm','pn','qp','rq','sr','ts','vt','wv','xw','yx','zy','bc',
         'cd','df','fg','gh','jk','km','mn','np','pq','qr','rs','st','tv','vw','wx','xy','yz','cbc',
         'dcd','fdf','gfg','hgh','kjk','mkm','nmn','pnp','qpq','rqr','srs','tst','vtv','wvw','xwx',
         'yxy','zyz','ccb','ddc','ffd','ggf','hhg','kkj','mmk','nnm','ppn','qqp','rrq','ssr','tts','vvt',
         'wwv','xxw','yyx''zzy','cbb','dcc','fdd','gff','hgg','kjj','mkk','nmm','pnn','qpp','rqq','srr',
         'tss','vtt','wvv','xww','yxx','zyy','bcb','cdc','dfd','fgf','ghg','jkj','kmk','mnm','npn','pqp',
         'qrq','rsr','sts','tvt','vwv','wxw','xyx','yzy','QRQ','RSR','STS','TVT','VWV','WXW','XYX','YZY',
        'DC','FD','GF','HG','KJ','MK','NM','PN','QP','RQ','SR','TS','VT','WV','XW','YX','ZY','BC',
         'CD','DF','FG','GH','JK','KM','MN','NP','PQ','QR','RS','ST','TV','VW','WX','XY','YZ','CBC',
         'DCD','FDF','GFG','HGH','KJK','MKM','NMN','PNP','QPQ','RQR','SRS','TST','VTV','WVW','XWX',
         'YXY','ZYZ','CCB','DDC','FFD','GGF','HHG','KKJ','MMK','NNM','PPN','QQP','RRQ','SSR','TTS','VVT',
         'WWV','XXW','YYX''ZZY','CBB','DCC','FDD','GFF','HGG','KJJ','MKK','NMM','PNN','QPP','RQQ','SRR',
         'TSS','VTT','WVV','XWW','YXX','ZYY','BCB','CDC','DFD','FGF','GHG','JKJ','KMK','MNM','NPN','PQP',] #characters which I can use

symbols_words = []
char = 0
for i in common100:
    symbols_words.append(characters[char]) #makes the array literally contain 0 values
        char = char + 1

print("Compression has now started")

f = 0
g = 0
no = 0
while no < 100:
    for i in common100:
        for w in words:
            if i == w and len(i)>1: #if the values in common200 are ACTUALLY in words
                place = words.index(i)#find exactly where the most common words are in the text
                symbols = symbols_words[common100.index(i)] #assigns one character with one common word
                words[place] = symbols # replaces the word with the symbol
                g = g + 1
    no = no + 1


string = words
stringMade = ' '.join(map(str, string))#makes the list into a string so you can put it into a text file
file = open("compression.txt", "w")
file.write(stringMade)#imports everything in the variable 'words' into the new file
file.close()

size2 = os.path.getsize('compression.txt')

no1 = int(size1)
no2 = int(size2)
print('Compression has finished.')
print('Your original file size has been compressed by', 100 - ((100/no1) * no2 ) ,'percent.'
  'The size of your file now is ', size2)

The first thing I see that is bad for performance is: 我认为不利于性能的第一件事是:

for i in common100:
    for w in words:
        if i == w and len(i)>1:
            ...

What you are doing is seeing if the word w is in your list of common100 words. 您正在执行的操作是查看单词w是否在您的common100单词列表中。 However, this check can be done in O(1) time by using a set and not looping through all of your top 100 words for each word. 但是,可以通过使用集合在O(1)时间内完成此检查,而不必循环浏览每个单词的前100个单词。

common_words = set(common100)
for w in words:
    if w in common_words:
        ...

Using something like 使用类似

word_substitutes = dict(zip(common100, characters))

will give you a dict that maps common words to their corresponding symbol. 将为您提供将常用字映射到其相应符号的字典。

Then you can simply iterate over the words: 然后,您可以简单地遍历以下单词:

# Iterate over all the words
# Use enumerate because we're going to modify the word in-place in the words list
for word_idx, word in enumerate(words):
    # If the current word is in the `word_substitutes` dict, then we know its in the
    # 'common' words, and can be replaced by the symbol
    if word in word_substitutes:
        # Replaces the word in-place
        replacement_symbol = word_substitutes[word]
        words[word_idx] = replacement_symbol

This will give much better performance, because the dictionary lookup used for the common word symbol mapping is logarithmic in time rather than linear. 这将提供更好的性能,因为用于公共单词符号映射的字典查找在时间上是对数的,而不是线性的。 So the overall complexity will be something like O(N log(N)) rather than O(N^3) that you get from the 2 nested loops with the .index() call inside that. 因此,总体复杂度将类似于O(N log(N)),而不是从其中带有.index()调用的两个嵌套循环中获得的O(N ^ 3)。

Generally you would do the following: 通常,您将执行以下操作:

  • Measure how much time each "part" of your program needs. 测量程序的每个“部分”需要多少时间。 You could use a profiler (eg this one in the standard library) or simply sprinkle some times.append(time.time.now) into your code and compute differences. 你可以使用一个分析器(如这一个标准库),或者干脆撒一些times.append(time.time.now)到你的代码,并计算差异。 Then you know which part of your code is slow. 然后,您知道代码的哪一部分很慢。
  • See if you can improve the algorithm of the slow part. 看看是否可以改进慢速部分的算法。 gnicholas answer shows one possibility to speed things up. gnicholas的答案显示了一种加快速度的可能性。 The while no<=100 seems suspiciously, maybe that can be improved. while no<=100似乎很可疑,但也许可以改进。 This step needs understanding of the algorithms you use. 此步骤需要了解您使用的算法。 Be careful to select the best data structures for your use case. 请注意为您的用例选择最佳的数据结构。
  • If you can't use a better algorithm (because you always use the best way to calculate something) you need to speed up the computations themselves. 如果您不能使用更好的算法(因为您始终使用最佳方法来计算某些东西),则需要自己加快计算速度。 Numerical stuff benefits from numpy , with cython you can basically compile python code to C and numba uses LLVM to compile. numpy带来了数字化的东西,使用cython可以将python代码基本上编译为C,而numba使用LLVM进行编译。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM