Scrapy一遍又一遍地抓住同一页面为德国网站上的不同网址

Question

I am trying to extract info on flats/rooms from a German site called WG-Gesucht . 我试图从一个名为WG-Gesucht的德国网站上提取公寓/房间的信息。 I kinda figured out that their links follow the logic: 我有点想到他们的链接遵循逻辑：

http:// www.wg-gesucht.de/wohnungen-in-Berlin.8.2.0.**X**.html`

where X=0, 1, 2, ... 其中X=0, 1, 2, ...

When I paste the links into my browser, they do work perfectly. 当我将链接粘贴到我的浏览器中时，它们确实可以正常工作。 However my optimism was shattered when I tried crawling those links. 然而，当我尝试抓取这些链接时，我的乐观情绪破灭了。 In the end I only get entries corresponding to X = 0 in my database. 最后，我只在我的数据库中获得对应于X = 0条目。

Here is my spider: 这是我的蜘蛛：

from scrapy.http.request import Request
from scrapy.spider import Spider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.loader.processor import Join, MapCompose

from scraper_app.items import WGGesuchtEntry




class WGGesuchtSpider(Spider):
    """Spider for wg-gesucht.de, Berlin"""
    name = "wggesucht"
    allowed_domains = ["wg-gesucht.de"]
    start_urls = ["http://www.wg-gesucht.de/wohnungen-in-Berlin.8.2.0.0.html"]
    # start_urls = ["http://www.wg-gesucht.de/wohnungen-in-Berlin.8.2.0.%s.html"%x for x in range(0,1)]


    entries_list_xpath = '//tr[contains(@id,"ad--")]'
    item_fields = {
        # 'title': './/span[@itemscope]/meta[@itemprop="name"]/@content',
        'rooms': './/td[2]/a/span/text()',
        'entry_date': './/td[3]/a/span/text()',
        'price': './/td[4]/a/span/b/text()',
        'size': './/td[5]/a/span/text()',
        'district': './/td[6]/a/span/text()',
        'start_date': './/td[7]/a/span/text()',
        'end_date': './/td[8]/a/span/text()',
        'link': './/@adid'
    }

    def start_requests(self):
        for i in xrange(1, 10):
            url = 'http://www.wg-gesucht.de/wohnungen-in-Berlin.8.2.0.' + str(i) + '.html'
            yield Request(url=url, callback=self.parse_items)


    def parse_items(self, response):
        """
        Default callback used by Scrapy to process downloaded responses

        # Testing contracts:
        # @url http://www.livingsocial.com/cities/15-san-francisco
        # @returns items 1
        # @scrapes title link

        """
        selector = HtmlXPathSelector(response)

        # iterate over deals
        for entry in selector.xpath(self.entries_list_xpath):
            loader = XPathItemLoader(WGGesuchtEntry(), selector=entry)

            # define processors
            loader.default_input_processor = MapCompose(unicode.strip)
            loader.default_output_processor = Join()

            # iterate over fields and add xpaths to the loader
            for field, xpath in self.item_fields.iteritems():
                loader.add_xpath(field, xpath)
            yield loader.load_item()

Should I maybe use the CrawlSpider instead of Spider ? 我应该使用CrawlSpider而不是Spider吗？

Any suggestions are mostly welcome, thank you! 欢迎任何建议，谢谢！

Answer 1

looks like a cookies problems, you can check that by opening a new browser and trying directly the 6th page for example, you are going to receive the response of the first page. 看起来像一个cookie问题，您可以通过打开一个新的浏览器并直接尝试第6页来检查，例如，您将收到第一页的响应。

Scrapy tries to use cookies for subsequent requests, so one way of solving this would be not iterating the requests to the page, but making one after the other like: Scrapy尝试将cookie用于后续请求，因此解决此问题的一种方法是不将请求迭代到页面，而是一个接一个地执行：

import re

start_urls = [http://example.com/0.html]

def parse(self, response):
    cur_index = response.meta.get('cur_index', 1)
    ...
    new_url = # use the response.url to change to the following url (+1 to the index)
    if cur_index < 10:
        yield Request(new_url, callback=self.parse, meta={'cur_index': cur_index+1})

Scrapy一遍又一遍地抓住同一页面为德国网站上的不同网址

问题描述

1 个解决方案

解决方案1
3 已采纳 2015-11-10 01:41:29

Scrapy一遍又一遍地抓住同一页面为德国网站上的不同网址

问题描述

1 个解决方案

解决方案1 3 已采纳 2015-11-10 01:41:29

解决方案1
3 已采纳 2015-11-10 01:41:29