简体   繁体   English

Python限制多线程

[英]Python limited multithreading

As you surely know, I can do multithreading to download files from the Internet faster. 您肯定知道,我可以通过多线程更快地从Internet下载文件。 But if I send lots of requests to the same website, I could be black listed. 但是,如果我向同一个网站发送大量请求,我可能会被列入黑名单。

So could you help me to implement something like "I've got a list of urls. I want you to download all of these files but if 10 downloads are already running, wait for a slot." 那么你能帮我实现一些类似“我有一个网址列表。我希望你下载所有这些文件,但如果已经有10个下载,请等待一个插槽。”

I'll appreciate any help. 我会感激任何帮助。 Tk. TK。

binoua binoua

This is the code I'm using (doesn't work). 这是我正在使用的代码(不起作用)。

class PDBDownloader(threading.Thread):

    prefix = 'http://www.rcsb.org/pdb/files/'

    def __init__(self, queue):
        threading.Thread.__init__(self)
        self.queue = queue
        self.pdbid = None
        self.urlstr = ''
        self.content = ''

    def run(self):
        while True:
            self.pdbid = self.queue.get()
            self.urlstr = self.prefix + pdbid + '.pdb'
            print 'downloading', pdbid
            self.download()

            filename = '%s.pdb' %(pdbid)
            f = open(filename, 'wt')
            f.write(self.content)
            f.close()

            self.queue.task_done()

    def download(self):
        try:
            f = urllib2.urlopen(self.urlstr)
        except urllib2.HTTPError, e:
            msg = 'HTTPError while downloading file %s at %s. '\
                    'Details: %s.' %(self.pdbid, self.urlstr, str(e))
            raise OstDownloadException, msg
        except urllib2.URLError, e:
            msg = 'URLError while downloading file %s at %s. '\
                    'RCSB erveur unavailable.' %(self.pdbid, self.urlstr)
            raise OstDownloadException, msg
        except Exception, e:
            raise OstDownloadException, str(e)
        else:
            self.content = f.read()
if __name__ == '__main__':

    pdblist = ['1BTA', '3EAM', '1EGJ', '2BV9', '2X6A']

    for i in xrange(len(pdblist)):
        pdb = PDBDownloader(queue)
        pdb.setDaemon(True)
        pdb.start()

    while pdblist:
        pdbid = pdblist.pop()
        queue.put(pdbid)

    queue.join()

Using threads doesn't " download files from the Internet faster ". 使用线程不会“ 更快地从Internet下载文件 ”。 You have only one network card and one internet connection so that's just not true. 你只有一个网卡和一个互联网连接,所以这是不正确的。

The threads are being used to wait , and you can't wait faster . 线程被用来等待 ,你不能等待更快

You can use a single thread and be as fast, or even faster -- Just don't wait for the response of one file before starting another. 您可以使用单个线程并且速度更快,甚至更快 - 只需在启动另一个文件之前不要等待一个文件的响应。 In other words, use asynchronous, non-blocking network programming. 换句话说,使用异步,非阻塞网络编程。

Here's a complete script that uses twisted.internet.task.coiterate to start multiple downloads at the same time, without using any kind of threading, and respecting the pool size (I'm using 2 simultaneous downloads for the demonstration, but you can change the size): 这是一个完整的脚本,它使用twisted.internet.task.coiterate同时启动多个下载,不使用任何类型的线程,并尊重池大小(我使用2个同时下载进行演示,但你可以改变尺寸):

from twisted.internet import defer, task, reactor
from twisted.web import client
from twisted.python import log

@defer.inlineCallbacks
def deferMap(job, dataSource, size=1):
    successes = []
    failures = []

    def _cbGather(result, dataUnit, succeeded):
        """This will be called when any download finishes"""
        if succeeded:
            # you could save the file to disk here
            successes.append((dataUnit, result))
        else:
            failures.append((dataUnit, result))

    @apply
    def work():
        for dataUnit in dataSource:
            d = job(dataUnit).addCallbacks(_cbGather, _cbGather,
                callbackArgs=(dataUnit, True),  errbackArgs=(dataUnit, False))
            yield d

    yield defer.DeferredList([task.coiterate(work) for i in xrange(size)])
    defer.returnValue((successes, failures))

def printResults(result):
    successes, failures = result
    print "*** Got %d pages total:" % (len(successes),)
    for url, page in successes:
        print '  * %s -> %d bytes' % (url, len(page))
    if failures:
        print "*** %d pages failed download:" % (len(failures),)
        for url, failure in failures:
            print '  * %s -> %s' % (url, failure.getErrorMessage())

if __name__ == '__main__':
    import sys
    log.startLogging(sys.stdout)
    urls = ['http://twistedmatrix.com',
            'XXX',
            'http://debian.org',
            'http://python.org',
            'http://python.org/foo',
            'https://launchpad.net',
            'noway.com',
            'somedata',
        ]
    pool = deferMap(client.getPage, urls, size=2) # download 2 at once
    pool.addCallback(printResults)
    pool.addErrback(log.err).addCallback(lambda ign: reactor.stop())
    reactor.run()

Note that I included some bad urls on purpose so we can see some failures in the result: 请注意,我故意包含了一些不良网址,因此我们可以在结果中看到一些失败:

...
2010-06-29 08:18:04-0300 [-] *** Got 4 pages total:
2010-06-29 08:18:04-0300 [-]   * http://twistedmatrix.com -> 16992 bytes
2010-06-29 08:18:04-0300 [-]   * http://python.org -> 17207 bytes
2010-06-29 08:18:04-0300 [-]   * http://debian.org -> 13820 bytes
2010-06-29 08:18:04-0300 [-]   * https://launchpad.net -> 18511 bytes
2010-06-29 08:18:04-0300 [-] *** 4 pages failed download:
2010-06-29 08:18:04-0300 [-]   * XXX -> Connection was refused by other side: 111: Connection refused.
2010-06-29 08:18:04-0300 [-]   * http://python.org/foo -> 404 Not Found
2010-06-29 08:18:04-0300 [-]   * noway.com -> Connection was refused by other side: 111: Connection refused.
2010-06-29 08:18:04-0300 [-]   * somedata -> Connection was refused by other side: 111: Connection refused.
...

Use a thread pool with a shared list of urls. 使用带有共享URL列表的线程池。 Each thread tries to pop a url from the list and download it until none are left. 每个线程都尝试从列表中pop一个url并下载它,直到没有剩下。 pop() from a list is threadsafe 列表中的pop()是线程安全的

while True:
    try:
        url = url_list.pop()
        # download URL here
    except IndexError:
        break

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM