[英]What should i do to make my program run faster?
I have a python program which processes a html page and creates a dictionary of urls as key and md5sum of the file as a value. 我有一个python程序,该程序处理html页面并创建url字典作为键,并创建文件的md5sum作为值。 The dictionary length is 6000. Each url is a zip file which is downloaded in to machine, each time checking md5sum after the file is downloaded.
字典的长度为6000。每个URL是一个zip文件,该文件下载到计算机中,每次下载文件后检查md5sum。 The total size of all the files that are to be downloaded is 572 GB.
要下载的所有文件的总大小为572 GB。
The URLs is a dictionary which has download links as key and md5sum of file as value URL是一个字典,其中以下载链接为键,文件的md5sum为值
The code is 该代码是
DownloadAllURLs(URLs)
def DownloadAllURLs(URLs):
for eachurl in URLs:
if os.path.isfile(eachurl):
print eachurl, "already exists"
else:
print "Going to Download",eachurl
Download(eachurl)
CheckMd5(eachurl,URLs(eachurl))
def Download(eachurl):
command='sudo wget --user=abc --password=xyz'
command=command+" "+url
print command
result=subprocess.Popen(command,shell=True,stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
out, err=result.communicate()
def CheckMd5(url,tail,md5sum):
command=['md5sum',tail]
result=subprocess.Popen(command,stdout=subprocess.PIPE,stdin=subprocess.PIPE)
md5, err=result.communicate()
if(md5[:32]==md5sum):
print "The",tail,"is downloaded successufully with correct md5"
else:
print "The",tail,"is not downloaded correcty wrong md5"
WriteWarcFile(url,tail)
CheckMd5(url,tail,md5sum)
The above code downloads all the 6000 zip files for me, but the server from where i am downloading is very slow and i get only 40-60 kbps some times when downloading .. 上面的代码为我下载了所有6000个zip文件,但是我从中下载的服务器速度非常慢,有时下载时我只能得到40-60 kbps。
I am using the above code to download like 1-3 terrabytes of data.... I want to parallelize my code in python (so the time taken to process will be reduced) but i am not sure whether to use multithreading or multiprocessing or anything else. 我正在使用上面的代码下载1-3 terrabytes的数据...。我想在python中并行化我的代码(这样可以减少处理时间),但是我不确定是否要使用多线程或多处理或还要别的吗。
I am reading the tutorials but not sure how to proceed. 我正在阅读教程,但不确定如何进行。 Thank you
谢谢
Edited: 编辑:
Thanks for all the replies, the main question i want to ask is how to apply multithreading/multithreading in cases like this. 感谢所有答复,我要问的主要问题是在这种情况下如何应用多线程/多线程。 Suppose i am doing some operations on every URL rather than downloading it like the below code, can i make it any faster using multithreading or mutlprocessing
假设我正在对每个URL进行一些操作,而不是像下面的代码那样下载它,我可以使用多线程或mutlprocessing使它更快吗
from urlparse import urlparse
ProcessAllURLs(URLs)
def ProcessAllURLs(URLs):
for eachurl in URLs:
x=urlparse(eachurl)
print eachurl.netloc
由于处理是受IO限制的,因此应该可以使用python多线程处理-全局解释器锁定不会对性能产生太大影响
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.