繁体   English   中英

不抓紧链接

[英]scrapy not following links

以下用于返回医疗信息的草率代码确实返回了第一组返回结果,但未遵循链接。 在stackoverflow上学习代码并检查类似的结果,但是集成它们不起作用。 没错,我正在学习。 任何指针将不胜感激。

import urlparse

from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from scrapy.http import Request
import w3lib.url

from yelp.items import YelpItem


class YelpSpider(BaseSpider):
    name = "yelp"
    download_delay = 10
    concurrent_requests = 1
    concurrent_requests_per_domain = 1
    allowed_domains = ["yelp.com"]
    start_urls = ["http://www.yelp.com/search?find_desc=cancer+treatment&find_loc=manhattan%2Cny&start=0",
"http://www.yelp.com/search?find_desc=cancer+treatment&find_loc=manhattan%2Cny&start=20",
"http://www.yelp.com/search?find_desc=cancer+treatment&find_loc=manhattan%2Cny&start=30"]

    def parse(self, response):
        selector = Selector(response)
        for title in selector.css("span.indexed-biz-name"):
            page_url = urlparse.urljoin(response.url,
                                        title.xpath("a/@href").extract()[0])
            self.log("page URL: %s" % page_url)
            #continue
            yield Request(page_url,
                          callback=self.parse_page)

        for next_page in selector.css(u'ul > li > a.prev-next:contains(\u2192)'):
            next_url = urlparse.urljoin(response.url,
                                        next_page.xpath('@href').extract()[0])
            self.log("next URL: %s" % next_url)
            #continue
            yield Request(next_url,
                          callback=self.parse)

    def parse_page(self, response):
        selector = Selector(response)
        item = YelpItem()
        item["name"] = selector.xpath('.//h1[@itemprop="name"]/text()').extract()[0].strip()
        item["addresslocality"] = u"\n".join(
            selector.xpath('.//address[@itemprop="address"]//text()').extract()).strip()
        item["link"] = response.url
        website = selector.css('div.biz-website a')
        if website:
            website_url = website.xpath('@href').extract()[0]
            item["website"] = w3lib.url.url_query_parameter(website_url, "url")
        return item

您的下一个URL提取和选择逻辑不正确。 定位具有nextpagination-links_anchor类的link元素。 以下适用于我:

next_url = response.css('a.pagination-links_anchor.next::attr(href)').extract_first()
if next_url:
    next_url = urlparse.urljoin(response.url, next_url)
    self.log("next URL: %s" % next_url)
    yield Request(next_url, callback=self.parse)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM