简体   繁体   English

我应该怎么做才能使程序运行更快?

[英]What should i do to make my program run faster?

I have a python program which processes a html page and creates a dictionary of urls as key and md5sum of the file as a value. 我有一个python程序,该程序处理html页面并创建url字典作为键,并创建文件的md5sum作为值。 The dictionary length is 6000. Each url is a zip file which is downloaded in to machine, each time checking md5sum after the file is downloaded. 字典的长度为6000。每个URL是一个zip文件,该文件下载到计算机中,每次下载文件后检查md5sum。 The total size of all the files that are to be downloaded is 572 GB. 要下载的所有文件的总大小为572 GB。

The URLs is a dictionary which has download links as key and md5sum of file as value URL是一个字典,其中以下载链接为键,文件的md5sum为值

The code is 该代码是

        DownloadAllURLs(URLs)

        def DownloadAllURLs(URLs):
            for eachurl in URLs:
                if os.path.isfile(eachurl):
                     print eachurl, "already exists"
                else:
                     print "Going to Download",eachurl
                     Download(eachurl)
                     CheckMd5(eachurl,URLs(eachurl))

        def Download(eachurl):
            command='sudo wget --user=abc --password=xyz'
            command=command+" "+url
            print command
            result=subprocess.Popen(command,shell=True,stdout=subprocess.PIPE,
            stderr=subprocess.PIPE)
            out, err=result.communicate()

        def CheckMd5(url,tail,md5sum):
            command=['md5sum',tail]
            result=subprocess.Popen(command,stdout=subprocess.PIPE,stdin=subprocess.PIPE)
            md5, err=result.communicate()
            if(md5[:32]==md5sum):
                print "The",tail,"is downloaded successufully with correct md5"
            else:
                print "The",tail,"is not downloaded correcty wrong md5"
               WriteWarcFile(url,tail)
               CheckMd5(url,tail,md5sum)

The above code downloads all the 6000 zip files for me, but the server from where i am downloading is very slow and i get only 40-60 kbps some times when downloading .. 上面的代码为我下载了所有6000个zip文件,但是我从中下载的服务器速度非常慢,有时下载时我只能得到40-60 kbps。

I am using the above code to download like 1-3 terrabytes of data.... I want to parallelize my code in python (so the time taken to process will be reduced) but i am not sure whether to use multithreading or multiprocessing or anything else. 我正在使用上面的代码下载1-3 terrabytes的数据...。我想在python中并行化我的代码(这样可以减少处理时间),但是我不确定是否要使用多线程或多处理或还要别的吗。

I am reading the tutorials but not sure how to proceed. 我正在阅读教程,但不确定如何进行。 Thank you 谢谢

Edited: 编辑:

Thanks for all the replies, the main question i want to ask is how to apply multithreading/multithreading in cases like this. 感谢所有答复,我要问的主要问题是在这种情况下如何应用多线程/多线程。 Suppose i am doing some operations on every URL rather than downloading it like the below code, can i make it any faster using multithreading or mutlprocessing 假设我正在对每个URL进行一些操作,而不是像下面的代码那样下载它,我可以使用多线程或mutlprocessing使它更快吗

    from urlparse import urlparse
    ProcessAllURLs(URLs)
    def ProcessAllURLs(URLs):
        for eachurl in URLs:
                x=urlparse(eachurl)
                print eachurl.netloc

由于处理是受IO限制的,因此应该可以使用python多线程处理-全局解释器锁定不会对性能产生太大影响

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM