简体   繁体   English

Scrapy搜寻器中的URL不会提供给下一个解析器

[英]URLs in Scrapy crawler are not yielded to the next parser

I came across a yielding problem, when I was trying to crawl http://www.brand-in-trend.ru . 当我尝试爬网http://www.brand-in-trend.ru时,遇到一个良性问题。 As you see below, I'm using Scrapy and defined a Basespider. 如下所示,我正在使用Scrapy并定义了Basespider。 The first parser works perfectly fine and returns all brands found on the start_url. 第一个解析器可以正常工作,并返回在start_url上找到的所有品牌。

Now, when I want to yield the callback Request to the categories parser, I don't get a response nor an Error. 现在,当我想将回调Request产生给类别解析器时,我既没有响应也没有Error。 The spider just quits. 蜘蛛刚刚退出。

Spider: 蜘蛛:

class brandintrend(BaseSpider):
name = "brandintrend"

allowed_domains = [ 'trend-in-brand.ru' ]

start_urls      = [ 'http://brand-in-trend.ru/brands/' ]

def parse(self, response):
    hxs         = HtmlXPathSelector(response)
    brands      = hxs.select('//div[@class="brandcol"]/ul/li/a/@href').extract()

    for brand in brands:
        brand = "http://www.brand-in-trend.ru" + brand
        print brand
        # request = Request(brand, callback=self.categories)
        yield Request(brand, callback=self.categories)

def categories(self, response):
    print "Hello World"
    hxs = HtmlXPathSelector(response)
    print response.url

I tried the following already to solve this issue: 我已经尝试了以下方法来解决此问题:

  1. I tested the generated brand urls (ex. http://www.brand-in-trend.ru/brands/parker/ ) in Chrome (Javasript turned off) and they worked fine. 我在Chrome(Javasript已关闭)中测试了生成的品牌网址(例如http://www.brand-in-trend.ru/brands/parker/ ),它们工作正常。
  2. I put all generated brand urls in the start_url list and tried to yield those directly to the categories parser 我将所有生成的品牌url放入start_url列表中,并尝试将其直接提供给类别解析器
  3. I looked at this post, which unfortunately didn't solve my problem: scrapy unable to make Request() callback 我看了这篇文章,不幸的是没有解决我的问题: scrapy无法进行Request()回调

If anybody came across a similar problem I would be greatful for a solution or advise 如果有人遇到类似问题,我将为您提供解决方案或建议

Thanks in advance 提前致谢

J Ĵ

This is because you set: 这是因为您设置了:

allowed_domains = [ 'trend-in-brand.ru' ]

but, you are crawling the url from a different domain: 但是,您正在从其他域抓取该网址:

start_urls = [ 'http://brand-in-trend.ru/brands/' ]

See trend-in-brand vs brand-in-trend . 参见trend-in-brandbrand-in-trend

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM