如何在Python中加快网络抓取？

Question

我正在使用urllib.urlopen()方法和BeautfulSoup进行爬网。 我对浏览速度不满意，正在考虑urllib解析的内容，猜测它只能加载比html更多的内容。 默认情况下，在文档中找不到它会读取或检查更大的数据（图像，闪存等）。

因此，如果urllib必须加载图像，Flash，JS ...如何避免此类数据类型的GET请求？

Answer 1

尝试请求 -它实现HTTP连接池，可加快抓取速度。

此外，它比urllib更好地处理了cookie，auth等其他内容，并且可以与BeautfulSoup一起使用。

Answer 2

使用线程！ 非常简单。 这是一个例子。 您可以更改连接数以适合您的需求。

import threading, Queue
import urllib

urls = [
    'http://www.google.com',
    'http://www.amazon.com',
    'http://www.ebay.com',
    'http://www.google.com',
    'http://www.amazon.com',
    'http://www.ebay.com',
    'http://www.google.com',
    'http://www.amazon.com',
    'http://www.ebay.com',    
    ]

queue = Queue.Queue()
for x,url in enumerate(urls):
    filename = "datafile%s-%s" % (x,url)
    queue.put((url, filename))


num_connections = 10

class WorkerThread(threading.Thread):
    def __init__(self, queue):
        threading.Thread.__init__(self)
        self.queue = queue

    def run(self):
        while 1:
            try:
                url, filename = self.queue.get_nowait()
            except Queue.Empty:
                raise SystemExit

            urllib.urlretrieve(url,filename.replace('http://',''))

# start threads
threads = []
for dummy in range(num_connections):
    t = WorkerThread(queue)
    t.start()
    threads.append(t)


# Wait for all threads to finish
for thread in threads:
    thread.join()

如何在Python中加快网络抓取？

问题描述

2 个解决方案

解决方案1
3 2014-04-01 14:27:24

解决方案2
2 2014-04-01 14:09:47

如何在Python中加快网络抓取？

问题描述

2 个解决方案

解决方案1 3 2014-04-01 14:27:24

解决方案2 2 2014-04-01 14:09:47

解决方案1
3 2014-04-01 14:27:24

解决方案2
2 2014-04-01 14:09:47