简体   繁体   English

网络搜寻(Python)期间的速度问题

[英]Problems with Speed during web-crawling (Python)

I would love to have this programm improve a lot in speed. 我希望这个程序可以大大提高速度。 It reads +- 12000 pages in 10 minutes. 它在10分钟内读取+-12000页。 I was wondering if there is something what would help a lot to the speed? 我想知道是否有什么对速度有很大帮助? I hope you guys know some tips. 我希望你们知道一些技巧。 I am supposed to read +- millions of pages... so that would take way too long :( Here is my code: 我应该阅读+-数百万页...,这将花费太长时间:(这是我的代码:

from eventlet.green import urllib2                          
import httplib                                              
import time                                                 
import eventlet   

# Create the URLS in groups of 400 (+- max for eventlet)                                    
def web_CreateURLS():
    print str(str(time.asctime( time.localtime(time.time()) )).split(" ")[3])
    for var_indexURLS in xrange(0, 2000000, 400):
        var_URLS = []
        for var_indexCRAWL in xrange(var_indexURLS, var_indexURLS+400):
            var_URLS.append("http://www.nu.nl")
        web_ScanURLS(var_URLS)    

# Return the HTML Source per URL
def web_ReturnHTML(url):
    try:
        return [urllib2.urlopen(url[0]).read(), url[1]]
    except urllib2.URLError:
        time.sleep(10)
        print "UrlError"
        web_ReturnHTML(url)

# Analyse the HTML Source
def web_ScanURLS(var_URLS):
    pool = eventlet.GreenPool()
    try:  
        for var_HTML in pool.imap(web_ReturnHTML, var_URLS):
               # do something etc..
    except TypeError: pass

web_CreateURLS()

I like using greenlets.. but I often benefit from using multiple processes spread over lots of systems.. or just one single system letting the OS take care of all the checks and balances of running multiple processes. 我喜欢使用greenlets ..,但是我经常受益于使用分布在许多系统上的多个进程。或者仅使用一个系统就可以让OS负责运行多个进程的所有检查和平衡。

Check out ZeroMQ at http://zeromq.org/ for some good examples on how to make a dispatcher with a TON of listeners that do whatever the dispatcher says. http://zeromq.org/上查看ZeroMQ,以获取有关如何使TON的侦听器完成调度程序所说的任何事情的良好示例。 Alternatively check out execnet for a method of quickly getting started with executing remote or local tasks in parallel. 另外,请签出execnet,以获取一种快速开始并行执行远程或本地任务的方法。

I also use http://spread.org/ a lot and have LOTS of systems listening to a common spread daemon.. it's a very useful message bus where results can be pooled back to and dispatched from a single thread pretty easily. 我也经常使用http://spread.org/,并且有很多系统在监听公用的传播守护程序。.这是非常有用的消息总线,可以将结果汇总到一个线程并从一个线程分派。

And then of course there is always redis pub/sub or sync. 然后当然总会有redis pub / sub或sync。 :) :)

"Share the load" “分担负担”

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM