[英]How to speed up web crawling in Python?
I'm using urllib.urlopen()
method and BeautfulSoup for crawling. 我正在使用urllib.urlopen()
方法和BeautfulSoup进行爬网。 I am not satisfied with browsing speed, and I'm thinking about what urllib is parsing, guessing it got to load more than html only. 我对浏览速度不满意,正在考虑urllib解析的内容,猜测它只能加载比html更多的内容。 Couldn't find in docs does it read or checks bigger data (images, flash, ...) by default. 默认情况下,在文档中找不到它会读取或检查更大的数据(图像,闪存等)。
So, If urllib has to load ie images, flash, js... how to avoid GET request for such data types? 因此,如果urllib必须加载图像,Flash,JS ...如何避免此类数据类型的GET请求?
Use threading! 使用线程! It's super simple. 非常简单。 Here's an example. 这是一个例子。 You can change the number of connections to suit your needs. 您可以更改连接数以适合您的需求。
import threading, Queue
import urllib
urls = [
'http://www.google.com',
'http://www.amazon.com',
'http://www.ebay.com',
'http://www.google.com',
'http://www.amazon.com',
'http://www.ebay.com',
'http://www.google.com',
'http://www.amazon.com',
'http://www.ebay.com',
]
queue = Queue.Queue()
for x,url in enumerate(urls):
filename = "datafile%s-%s" % (x,url)
queue.put((url, filename))
num_connections = 10
class WorkerThread(threading.Thread):
def __init__(self, queue):
threading.Thread.__init__(self)
self.queue = queue
def run(self):
while 1:
try:
url, filename = self.queue.get_nowait()
except Queue.Empty:
raise SystemExit
urllib.urlretrieve(url,filename.replace('http://',''))
# start threads
threads = []
for dummy in range(num_connections):
t = WorkerThread(queue)
t.start()
threads.append(t)
# Wait for all threads to finish
for thread in threads:
thread.join()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.