[英]How do I optimize the speed of my python compression code?
I have made a compression code, and have tested it on 10 KB text files, which took no less than 3 minutes. 我编写了一个压缩代码,并在10 KB的文本文件上进行了测试,耗时不少于3分钟。 However, I've tested it with a 1 MB file, which is the assessment assigned by my teacher, and it takes longer than half an hour. 但是,我已经用一个1 MB的文件进行了测试,这是老师指定的评估,并且耗时超过半小时。 Compared to my classmates, mine is irregularly long. 与我的同学相比,我的时间不规则。 It might be my computer or my code, but I have no idea. 可能是我的计算机或我的代码,但我不知道。 Does anyone know any tips or shortcuts into making the speed of my code shorter? 有谁知道使我的代码速度更短的提示或捷径吗? My compression code is below, if there are any quicker ways of doing loops, etc. please send me an answer (: 我的压缩代码如下,如果有更快的循环方法,等等。请给我一个答案(:
(by the way my code DOES work, so I'm not asking for corrections, just enhancements, or tips, thanks!) (通过我的代码可以正常工作的方式,所以我不是在要求更正,只是增强或提示,谢谢!)
import re #used to enable functions(loops, etc.) to find patterns in text file
import os #used for anything referring to directories(files)
from collections import Counter #used to keep track on how many times values are added
size1 = os.path.getsize('file.txt') #find the size(in bytes) of your file, INCLUDING SPACES
print('The size of your file is ', size1,)
words = re.findall('\w+', open('file.txt').read())
wordcounts = Counter(words) #turns all words into array, even capitals
common100 = [x for x, it in Counter(words).most_common(100)] #identifies the 200 most common words
keyword = []
kcount = []
z = dict(wordcounts)
for key, value in z.items():
keyword.append(key) #adds each keyword to the array called keywords
kcount.append(value)
characters =['$','#','@','!','%','^','&','*','(',')','~','-','/','{','[', ']', '+','=','}','|', '?','cb',
'dc','fd','gf','hg','kj','mk','nm','pn','qp','rq','sr','ts','vt','wv','xw','yx','zy','bc',
'cd','df','fg','gh','jk','km','mn','np','pq','qr','rs','st','tv','vw','wx','xy','yz','cbc',
'dcd','fdf','gfg','hgh','kjk','mkm','nmn','pnp','qpq','rqr','srs','tst','vtv','wvw','xwx',
'yxy','zyz','ccb','ddc','ffd','ggf','hhg','kkj','mmk','nnm','ppn','qqp','rrq','ssr','tts','vvt',
'wwv','xxw','yyx''zzy','cbb','dcc','fdd','gff','hgg','kjj','mkk','nmm','pnn','qpp','rqq','srr',
'tss','vtt','wvv','xww','yxx','zyy','bcb','cdc','dfd','fgf','ghg','jkj','kmk','mnm','npn','pqp',
'qrq','rsr','sts','tvt','vwv','wxw','xyx','yzy','QRQ','RSR','STS','TVT','VWV','WXW','XYX','YZY',
'DC','FD','GF','HG','KJ','MK','NM','PN','QP','RQ','SR','TS','VT','WV','XW','YX','ZY','BC',
'CD','DF','FG','GH','JK','KM','MN','NP','PQ','QR','RS','ST','TV','VW','WX','XY','YZ','CBC',
'DCD','FDF','GFG','HGH','KJK','MKM','NMN','PNP','QPQ','RQR','SRS','TST','VTV','WVW','XWX',
'YXY','ZYZ','CCB','DDC','FFD','GGF','HHG','KKJ','MMK','NNM','PPN','QQP','RRQ','SSR','TTS','VVT',
'WWV','XXW','YYX''ZZY','CBB','DCC','FDD','GFF','HGG','KJJ','MKK','NMM','PNN','QPP','RQQ','SRR',
'TSS','VTT','WVV','XWW','YXX','ZYY','BCB','CDC','DFD','FGF','GHG','JKJ','KMK','MNM','NPN','PQP',] #characters which I can use
symbols_words = []
char = 0
for i in common100:
symbols_words.append(characters[char]) #makes the array literally contain 0 values
char = char + 1
print("Compression has now started")
f = 0
g = 0
no = 0
while no < 100:
for i in common100:
for w in words:
if i == w and len(i)>1: #if the values in common200 are ACTUALLY in words
place = words.index(i)#find exactly where the most common words are in the text
symbols = symbols_words[common100.index(i)] #assigns one character with one common word
words[place] = symbols # replaces the word with the symbol
g = g + 1
no = no + 1
string = words
stringMade = ' '.join(map(str, string))#makes the list into a string so you can put it into a text file
file = open("compression.txt", "w")
file.write(stringMade)#imports everything in the variable 'words' into the new file
file.close()
size2 = os.path.getsize('compression.txt')
no1 = int(size1)
no2 = int(size2)
print('Compression has finished.')
print('Your original file size has been compressed by', 100 - ((100/no1) * no2 ) ,'percent.'
'The size of your file now is ', size2)
The first thing I see that is bad for performance is: 我认为不利于性能的第一件事是:
for i in common100:
for w in words:
if i == w and len(i)>1:
...
What you are doing is seeing if the word w is in your list of common100 words. 您正在执行的操作是查看单词w是否在您的common100单词列表中。 However, this check can be done in O(1) time by using a set and not looping through all of your top 100 words for each word. 但是,可以通过使用集合在O(1)时间内完成此检查,而不必循环浏览每个单词的前100个单词。
common_words = set(common100)
for w in words:
if w in common_words:
...
Using something like 使用类似
word_substitutes = dict(zip(common100, characters))
will give you a dict that maps common words to their corresponding symbol. 将为您提供将常用字映射到其相应符号的字典。
Then you can simply iterate over the words: 然后,您可以简单地遍历以下单词:
# Iterate over all the words
# Use enumerate because we're going to modify the word in-place in the words list
for word_idx, word in enumerate(words):
# If the current word is in the `word_substitutes` dict, then we know its in the
# 'common' words, and can be replaced by the symbol
if word in word_substitutes:
# Replaces the word in-place
replacement_symbol = word_substitutes[word]
words[word_idx] = replacement_symbol
This will give much better performance, because the dictionary lookup used for the common word symbol mapping is logarithmic in time rather than linear. 这将提供更好的性能,因为用于公共单词符号映射的字典查找在时间上是对数的,而不是线性的。 So the overall complexity will be something like O(N log(N)) rather than O(N^3) that you get from the 2 nested loops with the .index()
call inside that. 因此,总体复杂度将类似于O(N log(N)),而不是从其中带有.index()
调用的两个嵌套循环中获得的O(N ^ 3)。
Generally you would do the following: 通常,您将执行以下操作:
times.append(time.time.now)
into your code and compute differences. 你可以使用一个分析器(如这一个标准库),或者干脆撒一些times.append(time.time.now)
到你的代码,并计算差异。 Then you know which part of your code is slow. 然后,您知道代码的哪一部分很慢。 while no<=100
seems suspiciously, maybe that can be improved. while no<=100
似乎很可疑,但也许可以改进。 This step needs understanding of the algorithms you use. 此步骤需要了解您使用的算法。 Be careful to select the best data structures for your use case. 请注意为您的用例选择最佳的数据结构。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.