简体   繁体   English

Scrapy:蜘蛛什么都不返回

[英]Scrapy: spider returns nothing

this is my first time creating a spider and in spite my efforts it continues to return nothing to my csv export.这是我第一次创建蜘蛛,尽管我付出了努力,但它仍然没有为我的 csv 导出返回任何内容。 My code is:我的代码是:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector

class Emag(CrawlSpider):
    name = "emag"
    allowed_domains = ["emag.ro"]
    start_urls = [
        "http://www.emag.ro/"]

    rules = (Rule(SgmlLinkExtractor(allow=(r'www.emag.ro')), callback="parse", follow= True))

    def parse(self, response):
        sel = Selector(response)
        sites = sel.xpath('//a/@href').extract()
        for site in sites:
            site = str(site)

        for clean_site in site:
            name = clean_site.xpath('//[@id=""]/span').extract()
            return name

the thing is that if i print the sites, it bring me a list of the URLs, which is OK.问题是,如果我打印这些站点,它会给我一个 URL 列表,这是可以的。 if i search for the name inside one of the URLs in scrapy shell, it will find it.如果我在 scrapy shell 中的 URL 之一中搜索名称,它会找到它。 the problem is when i what all the names in all links crawled.I run it with "scrapy crawl emag>emag.csv"问题是当我抓取所有链接中的所有名称时。我用“scrapy crawl emag>emag.cs​​v”运行它

Can you please give me a hint whats wrong?你能给我一个提示什么是错的吗?

Multiple problems in the spider:蜘蛛中的多个问题:

  • rules should be an iterable, missing comma before the last parenthesis rules应该是一个可迭代的,最后一个括号前缺少逗号
  • no Item s specified - you need to define an Item class and return/yield it from the spider parse() callback未指定Item - 您需要定义一个Item类并从蜘蛛parse()回调中返回/产生它

Here's a fixed version of the spider:这是蜘蛛的固定版本:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Field, Item


class MyItem(Item):
    name = Field()


class Emag(CrawlSpider):
    name = "emag"
    allowed_domains = ["emag.ro"]
    start_urls = [
        "http://www.emag.ro/"]

    rules = (Rule(SgmlLinkExtractor(allow=(r'www.emag.ro')), callback="parse", follow=True), )

    def parse(self, response):
        sel = Selector(response)
        sites = sel.xpath('//a/@href')
        for site in sites:
            item = MyItem()
            item['name'] = site.xpath('//[@id=""]/span').extract()
            yield item

One problem might be, you have been forbidden by robots.txt for the site You can check that from the log trace.一个问题可能是,您已被该站点的 robots.txt 禁止您可以从日志跟踪中查看。 If so go to your settings.py and make ROBOTSTXT_OBEY=False That solved my issue如果是这样,请转到您的 settings.py 并使 ROBOTSTXT_OBEY=False 解决了我的问题

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM