简体   繁体   English

使用Python和Scrapy进行递归爬行

[英]recursive crawling with Python and Scrapy

I'm using scrapy to crawl a site. 我正在使用scrapy来抓取网站。 The site has 15 listings per page and then has a next button. 该网站每页有15个列表,然后有一个下一个按钮。 I am running into an issue where my Request for the next link is being called before I am finished parsing all of my listings in pipeline. 我遇到了一个问题,在我完成解析管道中的所有列表之前,我正在调用下一个链接的请求。 Here is the code for my spider: 这是我的蜘蛛的代码:

class MySpider(CrawlSpider):
    name = 'mysite.com'
    allowed_domains = ['mysite.com']
    start_url = 'http://www.mysite.com/'

    def start_requests(self):
        return [Request(self.start_url, callback=self.parse_listings)]

    def parse_listings(self, response):
        hxs = HtmlXPathSelector(response)
        listings = hxs.select('...')

        for listing in listings:
            il = MySiteLoader(selector=listing)
            il.add_xpath('Title', '...')
            il.add_xpath('Link', '...')

            item = il.load_item()
            listing_url = listing.select('...').extract()

            if listing_url:
                yield Request(urlparse.urljoin(response.url, listing_url[0]),
                              meta={'item': item},
                              callback=self.parse_listing_details)

        next_page_url = hxs.select('descendant::div[@id="pagination"]/'
                                   'div[@class="next-link"]/a/@href').extract()
        if next_page_url:
            yield Request(urlparse.urljoin(response.url, next_page_url[0]),
                          callback=self.parse_listings)


    def parse_listing_details(self, response):
        hxs = HtmlXPathSelector(response)
        item = response.request.meta['item']
        details = hxs.select('...')
        il = MySiteLoader(selector=details, item=item)

        il.add_xpath('Posted_on_Date', '...')
        il.add_xpath('Description', '...')
        return il.load_item()

These lines are the problem. 这些线是问题所在。 Like I said before, they are being executed before the spider has finished crawling the current page. 就像我之前说过的那样,它们在蜘蛛爬完当前页面之前就被执行了。 On every page of the site, this causes only 3 out 15 of my listings to be sent to the pipeline. 在网站的每个页面上,这导致我的列表中只有3个被发送到管道。

     if next_page_url:
            yield Request(urlparse.urljoin(response.url, next_page_url[0]),
                          callback=self.parse_listings)

This is my first spider and might be a design flaw on my part, is there a better way to do this? 这是我的第一个蜘蛛,可能是我的设计缺陷,有更好的方法吗?

Scrape instead of spider? 刮而不是蜘蛛?

Because your original problem requires the repeated navigation of a consecutive and repeated set of content instead of a tree of content of unknown size, use mechanize (http://wwwsearch.sourceforge.net/mechanize/) and beautifulsoup (http://www.crummy.com/software/BeautifulSoup/). 因为您的原始问题需要重复导航连续和重复的内容集而不是未知大小的内容树,请使用mechanize(http://wwwsearch.sourceforge.net/mechanize/)和beautifulsoup(http:// www .crummy.com /软件/ BeautifulSoup /)。

Here's an example of instantiating a browser using mechanize. 这是使用mechanize实例化浏览器的示例。 Also, using the br.follow_link(text="foo") means that, unlike the xpath in your example, the links will still be followed no matter the structure of the elements in the ancestor path. 此外,使用br.follow_link(text =“foo”)意味着,与示例中的xpath不同,无论祖先路径中元素的结构如何,仍将遵循链接。 Meaning, if they update their HTML your script breaks. 这意味着,如果他们更新HTML,那么您的脚本会中断 A looser coupling will save you some maintenance. 更松散的联轴器可以为您节省一些维护费用。 Here is an example: 这是一个例子:

br = mechanize.Browser()
br.set_handle_equiv(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:9.0.1)Gecko/20100101 Firefox/9.0.1')]
br.addheaders = [('Accept-Language','en-US')]
br.addheaders = [('Accept-Encoding','gzip, deflate')]
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.open("http://amazon.com")
br.follow_link(text="Today's Deals")
print br.response().read()

Also, in the "next 15" href there is probably something indicating pagination eg &index=15. 此外,在“接下来的15”href中,可能存在表示分页的内容,例如&index = 15。 If the total number of items on all pages is available on the first page, then: 如果第一页上的所有页面上的项目总数可用,则:

soup = BeautifulSoup(br.response().read())
totalItems = soup.findAll(id="results-count-total")[0].text
startVar =  [x for x in range(int(totalItems)) if x % 15 == 0]

Then just iterate over startVar and create the url, add the value of startVar to the url, br.open() it and scrape the data. 然后只需遍历startVar并创建url,将startVar的值添加到url,br.open()并删除数据。 That way you don't have to programatically "find" the "next" link on the page and execute a click on it to advance to the next page - you already know all the valid urls. 这样您就不必以编程方式“找到”页面上的“下一个”链接并执行单击它以前进到下一页 - 您已经知道所有有效的URL。 Minimizing code driven manipulation of the page to only the data you need will speed up your extraction. 将代码驱动的页面操作最小化为仅需要的数据将加快提取速度。

There are two ways of doing this sequentially: 有两种方法可以按顺序执行此操作:

  1. by defining a listing_url list under the class. 通过在类下定义listing_url列表。
  2. by defining the listing_url inside the parse_listings() . 通过在parse_listings()定义listing_url

The only difference is verbage. 唯一的区别是verbage。 Also, suppose there are five pages to get listing_urls . 另外,假设有五个页面可以获得listing_urls So put page=1 under class as well. 所以把page=1放在课堂下面。

In the parse_listings method, only make a request once. parse_listings方法中,只发出一次请求。 Put all the data into the meta that you need to keep track of. 将所有数据放入您需要跟踪的meta数据中。 That being said, use parse_listings only to parse the 'front page'. 话虽这么说,只使用parse_listings来解析'首页'。

Once you reached the end of the line, return your items. 到达生产线末尾后,退回商品。 This process is sequential. 这个过程是顺序的。

class MySpider(CrawlSpider):
    name = 'mysite.com'
    allowed_domains = ['mysite.com']
    start_url = 'http://www.mysite.com/'

    listing_url = []
    page = 1

    def start_requests(self):
        return [Request(self.start_url, meta={'page': page}, callback=self.parse_listings)]

    def parse_listings(self, response):
        hxs = HtmlXPathSelector(response)
        listings = hxs.select('...')

        for listing in listings:
            il = MySiteLoader(selector=listing)
            il.add_xpath('Title', '...')
            il.add_xpath('Link', '...')

        items = il.load_item()

        # populate the listing_url with the scraped URLs
        self.listing_url.extend(listing.select('...').extract())

        next_page_url = hxs.select('descendant::div[@id="pagination"]/'
                                   'div[@class="next-link"]/a/@href').extract()

        # now that the front page is done, move on to the next listing_url.pop(0)
        # add the next_page_url to the meta data
        return Request(urlparse.urljoin(response.url, self.listing_url.pop(0)),
                            meta={'page': self.page, 'items': items, 'next_page_url': next_page_url},
                            callback=self.parse_listing_details)

    def parse_listing_details(self, response):
        hxs = HtmlXPathSelector(response)
        item = response.request.meta['item']
        details = hxs.select('...')
        il = MySiteLoader(selector=details, item=item)

        il.add_xpath('Posted_on_Date', '...')
        il.add_xpath('Description', '...')
        items = il.load_item()

        # check to see if you have any more listing_urls to parse and last page
        if self.listing_urls:
            return Request(urlparse.urljoin(response.url, self.listing_urls.pop(0)),
                            meta={'page': self.page, 'items': items, 'next_page_url': response.meta['next_page_url']},
                            callback=self.parse_listings_details)
        elif not self.listing_urls and response.meta['page'] != 5:
            # loop back for more URLs to crawl
            return Request(urlparse.urljoin(response.url, response.meta['next_page_url']),
                            meta={'page': self.page + 1, 'items': items},
                            callback=self.parse_listings)
        else:
            # reached the end of the pages to crawl, return data
            return il.load_item()

You can yield requests or items as many times as you need. 您可以根据需要多次提出请求或项目。

def parse_category(self, response):
    # Get links to other categories
    categories = hxs.select('.../@href').extract()

    # First, return CategoryItem
    yield l.load_item()

    for url in categories:
        # Than return request for parse category
        yield Request(url, self.parse_category)

I found that here — https://groups.google.com/d/msg/scrapy-users/tHAAgnuIPR4/0ImtdyIoZKYJ 我在这里找到了 - https://groups.google.com/d/msg/scrapy-users/tHAAgnuIPR4/0ImtdyIoZKYJ

See below for an updated answer, under the EDIT 2 section (updated October 6th, 2017) 请参阅下面的编辑2部分(2017年10月6日更新)中的更新答案

Is there any specific reason that you're using yield? 您是否有使用产量的具体原因? Yield will return a generator, which will return the Request object when .next() is invoked on it. Yield将返回一个生成器,当它调用.next()时将返回Request对象。

Change your yield statements to return statements and things should work as expected. yield语句更改为return语句,事情应该按预期工作。

Here's an example of a generator: 这是一个生成器的例子:

In [1]: def foo(request):
   ...:     yield 1
   ...:     
   ...:     

In [2]: print foo(None)
<generator object foo at 0x10151c960>

In [3]: foo(None).next()
Out[3]: 1

EDIT: 编辑:

Change your def start_requests(self) function to use the follow parameter. 更改def start_requests(self)函数以使用follow参数。

return [Request(self.start_url, callback=self.parse_listings, follow=True)]

EDIT 2: 编辑2:

As of Scrapy v1.4.0, released on 2017-05-18, it is now recommended to use response.follow instead of creating scrapy.Request objects directly. 从2017-05-18发布的Scrapy v1.4.0开始,现在建议使用response.follow而不是直接创建scrapy.Request对象。

From the release notes : 发行说明

There's a new response.follow method for creating requests; 有一个新的response.follow方法用于创建请求; it is now a recommended way to create Requests in Scrapy spiders. 现在,它是在Scrapy蜘蛛中创建请求的推荐方法。 This method makes it easier to write correct spiders; 这种方法可以更容易地编写正确的蜘蛛; response.follow has several advantages over creating scrapy.Request objects directly: response.follow比直接创建scrapy.Request对象有几个优点:

  • it handles relative URLs; 它处理相对的URL;
  • it works properly with non-ascii URLs on non-UTF8 pages; 它适用于非UTF8页面上的非ascii URL;
  • in addition to absolute and relative URLs it supports Selectors; 除了绝对和相对URL,它还支持选择器; for elements it can also extract their href values. 对于元素,它还可以提取其href值。

So, for the OP above, change the code from: 因此,对于上面的OP,请更改以下代码:

    next_page_url = hxs.select('descendant::div[@id="pagination"]/'
                               'div[@class="next-link"]/a/@href').extract()
    if next_page_url:
        yield Request(urlparse.urljoin(response.url, next_page_url[0]),
                      callback=self.parse_listings)

to: 至:

    next_page_url = hxs.select('descendant::div[@id="pagination"]/'
                               'div[@class="next-link"]/a/@href')
    if next_page_url is not None:
        yield response.follow(next_page_url, self.parse_listings)

I just fixed this same problem in my code. 我刚刚在我的代码中解决了同样的问题。 I used the SQLite3 database that comes as part of Python 2.7 to fix it: Each item you are collecting info about gets its unique line put into a database table in the first pass of the parse function, and each instance of the parse callback adds each item's data to the table and line for that item. 我使用SQLite3数据库作为Python 2.7的一部分来修复它:你收集信息的每个项目在解析函数的第一次传递中将其唯一的行放入数据库表中,并且解析回调的每个实例都添加每个item的数据到该项的表和行。 Keep a instance counter so that the last callback parse routine knows that it is last one, and writes the CSV file from the database or whatever. 保留一个实例计数器,以便最后一个回调解析例程知道它是最后一个,并从数据库或其他任何地方写入CSV文件。 The callback can be recursive, being told in meta which parse schema (and of course which item) it was dispatched to work with. 回调可以是递归的,在meta中被告知解析它被调度使用的模式(当然还有哪个项目)。 Works for me like a charm. 对我来说就像一个魅力。 You have SQLite3 if you have Python. 如果你有Python,你有SQLite3。 Here was my post when I first discovered scrapy's limitation in this regard: Is Scrapy's asynchronicity what is hindering my CSV results file from being created straightforwardly? 这是我的帖子,当我第一次发现scrapy在这方面的局限性时: Scrapy的异步性是什么阻碍了我的CSV结果文件被直接创建?

http://autopython.blogspot.com/2014/04/recursive-scraping-using-different.html http://autopython.blogspot.com/2014/04/recursive-scraping-using-different.html

this example show how to scrap mulitple next pages from a website using different techniques 此示例显示如何使用不同的技术从网站中删除多个下一页

You might want to look into two things. 您可能想要研究两件事。

  1. The website you are crawling may be blocking the user agent you have defined. 您正在抓取的网站可能阻止您定义的用户代理。
  2. Try adding a DOWNLOAD_DELAY to your spider. 尝试将DOWNLOAD_DELAY添加到您的蜘蛛中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM