简体   繁体   English

如何在Python中使用嵌套的urllib2.urlopen()加快网络抓取速度?

[英]How to speed up web scraping with nested urllib2.urlopen() in Python?

I have the following code to gather the number of words there are in each chapter of a book. 我有以下代码来收集一本书每一章中的单词数。 In a nutshell, it opens the url of each book, then the urls of each chapter associated with the book. 简而言之,它会打开每本书的网址,然后打开与该书关联的每一章的网址。

import urllib2
from bs4 import BeautifulSoup
import re

def scrapeBook(bookId):
    url = 'http://www.qidian.com/BookReader/'+str(bookId)+'.aspx'
    try:
        words = []
        html = urllib2.urlopen(url,'html').read()
        soup = BeautifulSoup(html)           
        try:                             
            chapters = soup.find_all('a', rel='nofollow')  # find all relevant chapters
            for chapter in chapters:                       # loop through chapters
                if 'title' in chapter.attrs: 
                    link = chapter['href']                 # go to chapter to find words
                    htmlTemp = urllib2.urlopen(link,'html').read()
                    soupTemp = BeautifulSoup(htmlTemp)

                    # find out how many words there are in each chapter
                    spans = soupTemp.find_all('span')
                    for span in spans:
                        content = span.string
                        if not content == None:
                            if u'\u5b57\u6570' in content:
                               word = re.sub("[^0-9]", "", content)
                               words.append(word)
        except: pass

        return words

    except:       
        print 'Book'+ str(bookId) + 'does not exist'    

Below is a sample run 以下是运行示例

words = scrapeBook(3501537)
print words
>> [u'2532', u'2486', u'2510', u'2223', u'2349', u'2169', u'2259', u'2194', u'2151', u'2422', u'2159', u'2217', u'2158', u'2134', u'2098', u'2139', u'2216', u'2282', u'2298', u'2124', u'2242', u'2224', u'178', u'2168', u'2334', u'2132', u'2176', u'2271', u'2237']

Without doubt the code is very slow. 毫无疑问,代码非常慢。 One major reason is that I need to open the url for each book, and for each book I need to open the url of each chapter. 一个主要的原因是,我需要打开每本书的URL,并且我需要为每本书打开每章的URL。 Is there a way to make the process faster? 有没有办法使过程更快?

Here is another bookId without empty return 3052409. It has hundreds of chapters, and the code runs forever. 这是另一个没有空返回3052409的bookId。它具有数百个章节,并且代码将永远运行。

The fact that you need to open each book and each chapter is dictated by the views exposed on the server. 您需要打开每本书和每一章的事实是由服务器上公开的视图所决定的。 What you could do, it to implement parallel clients. 您可以做什么,以实现并行客户端。 Create a thread pool where you offload HTTP requests as jobs to the workers, or do something similar with coroutines. 创建一个线程池,您可以在其中将HTTP请求作为作业分担给工作程序,或者对协程进行类似的操作。

Then there's the choice of the HTTP client library. 然后可以选择HTTP客户端库。 I found libcurl and geventhttpclient more CPU efficient than urllib or any other of the python standard libs. 我发现libcurlgeventhttpclient CPU效率比urllib或任何其他python标准库更高。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM