简体   繁体   English

如何在Python中加快网络抓取?

[英]How to speed up web crawling in Python?

I'm using urllib.urlopen() method and BeautfulSoup for crawling. 我正在使用urllib.urlopen()方法和BeautfulSoup进行爬网。 I am not satisfied with browsing speed, and I'm thinking about what urllib is parsing, guessing it got to load more than html only. 我对浏览速度不满意,正在考虑urllib解析的内容,猜测它只能加载比html更多的内容。 Couldn't find in docs does it read or checks bigger data (images, flash, ...) by default. 默认情况下,在文档中找不到它会读取或检查更大的数据(图像,闪存等)。

So, If urllib has to load ie images, flash, js... how to avoid GET request for such data types? 因此,如果urllib必须加载图像,Flash,JS ...如何避免此类数据类型的GET请求?

Try requests - it implements HTTP connection pools which speeds up the crawling. 尝试请求 -它实现HTTP连接池,可加快抓取速度。

Also, it takes care of other things like cookies, auth etc. much better than urllib and works great with BeautfulSoup.. 此外,它比urllib更好地处理了cookie,auth等其他内容,并且可以与BeautfulSoup一起使用。

Use threading! 使用线程! It's super simple. 非常简单。 Here's an example. 这是一个例子。 You can change the number of connections to suit your needs. 您可以更改连接数以适合您的需求。

import threading, Queue
import urllib

urls = [
    'http://www.google.com',
    'http://www.amazon.com',
    'http://www.ebay.com',
    'http://www.google.com',
    'http://www.amazon.com',
    'http://www.ebay.com',
    'http://www.google.com',
    'http://www.amazon.com',
    'http://www.ebay.com',    
    ]

queue = Queue.Queue()
for x,url in enumerate(urls):
    filename = "datafile%s-%s" % (x,url)
    queue.put((url, filename))


num_connections = 10

class WorkerThread(threading.Thread):
    def __init__(self, queue):
        threading.Thread.__init__(self)
        self.queue = queue

    def run(self):
        while 1:
            try:
                url, filename = self.queue.get_nowait()
            except Queue.Empty:
                raise SystemExit

            urllib.urlretrieve(url,filename.replace('http://',''))

# start threads
threads = []
for dummy in range(num_connections):
    t = WorkerThread(queue)
    t.start()
    threads.append(t)


# Wait for all threads to finish
for thread in threads:
    thread.join()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM