简体   繁体   English

Scrapy没有爬网任何URL

[英]Scrapy is not crawling any URLs

I placed my code in scrapy shell to test my xpath everything seems ok. 我将代码放在scrapy shell中,以测试我的xpath,一切似乎都正常。 However I cannot see why is 0 crawls. 但是我看不到为什么是0爬网。 Here is the log output: 这是日志输出:

2019-02-27 18:04:47 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: jumia) 2019-02-27 18:04:47 [scrapy.utils.log] INFO: Versions: lxml 4.3.0.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 18.9.0, Python 2.7.15+ (default, Nov 28 2018, 16:27:22) - [GCC 8.2.0], pyOpenSSL 18.0.0 (OpenSSL 1.1.0j 20 Nov 2018), cryptography 2.4.2, Platform Linux-4.19.0-kali1-amd64-x86_64-with-Kali-kali-rolling-kali-rolling 2019-02-27 18:04:47 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'jumia.spiders', 'SPIDER_MODULES': ['jumia.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'jumia'} 2019-02-27 18:04:47 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.logstats.LogStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.corestats.CoreStats'] 2019-02-27 18:04:47 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downl 2019-02-27 18:04:47 [scrapy.utils.log]信息:开始Scrapy 1.5.1(bot:jumia)2019-02-27 18:04:47 [scrapy.utils.log]信息:版本: lxml 4.3.0.0,libxml2 2.9.9,cssselect 1.0.3,parsel 1.5.1,w3lib 1.20.0,Twisted 18.9.0,Python 2.7.15+(默认值,2018年11月28日,16:27:22)-[ GCC 8.2.0],pyOpenSSL 18.0.0(OpenSSL 1.1.0j 2018年11月20日),加密2.4.2,平台Linux-4.19.0-kali1-amd64-x86_64-with-Kali-kali-rolling-kali-rolling 2019 -02-27 18:04:47 [scrapy.crawler]信息:覆盖的设置:{'NEWSPIDER_MODULE':'jumia.spiders','SPIDER_MODULES':['jumia.spiders'],'ROBOTSTXT_OBEY':True,'BOT_NAME ':'jumia'} 2019-02-27 18:04:47 [scrapy.middleware]信息:启用的扩展名:['scrapy.extensions.memusage.MemoryUsage','scrapy.extensions.logstats.LogStats','scrapy。 extensions.telnet.TelnetConsole','scrapy.extensions.corestats.CoreStats'] 2019-02-27 18:04:47 [scrapy.middleware]信息:已启用下载程序中间件:['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',' scrapy.downl oadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2019-02-27 18:04:47 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlew oadermiddlewares.httpauth.HttpAuthMiddleware”,“ scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware”,“ scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware”,“ scrapy.downloadermiddlewares.useragent.UserAgentMiddleware”,“ Scrapy.downloadermtry”,“ Scrapy.downloadermtry”。 downloadermiddlewares.redirect.MetaRefreshMiddleware','scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware','scrapy.downloadermiddlewares.redirect.RedirectMiddleware','scrapy.downloadermiddlewares.cookies.CookiesMiddleware','scrapy.downloader downloadermiddlewares.stats.DownloaderStats'] 2019-02-27 18:04:47 [scrapy.middleware]信息:已启用蜘蛛中间件:['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware','scrapy.spidermiddlewares.offsite.OffsiteMiddleware',' scrapy.spidermiddlewares.referer.RefererMiddleware','scrapy.spidermiddlewares.urllength.UrlLengthMiddleware','scrapy.spidermiddlew ares.depth.DepthMiddleware'] 2019-02-27 18:04:47 [scrapy.middleware] INFO: Enabled item pipelines: [] 2019-02-27 18:04:47 [scrapy.core.engine] INFO: Spider opened 2019-02-27 18:04:47 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2019-02-27 18:04:47 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6029 2019-02-27 18:04:47 [scrapy.core.engine] INFO: Closing spider (finished) 2019-02-27 18:04:47 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'finish_reason': 'finished', 'finish_time': datetime.datetime(2019, 2, 27, 17, 4, 47, 950397), 'log_count/DEBUG': 1, 'log_count/INFO': 7, 'memusage/max': 53383168, 'memusage/startup': 53383168, 'start_time': datetime.datetime(2019, 2, 27, 17, 4, 47, 947520)} 2019-02-27 18:04:47 [scrapy.core.engine] INFO: Spider closed (finished) ares.depth.DepthMiddleware'] 2019-02-27 18:04:47 [scrapy.middleware] INFO:启用的项目管道:[] 2019-02-27 18:04:47 [scrapy.core.engine] INFO:蜘蛛打开2019-02-27 18:04:47 [scrapy.extensions.logstats]信息:抓取0页(以0页/分钟),抓取0项(以0项目/分钟)2019-02-27 18:04: 47 [scrapy.extensions.telnet]调试:Telnet控制台正在监听127.0.0.1:6029 2019-02-27 18:04:47 [scrapy.core.engine]信息:关闭蜘蛛(已完成)2019-02-27 18: 04:47 [scrapy.statscollectors]信息:转储Scrapy统计信息:{'finish_reason':'finished','finish_time':datetime.datetime(2019,2,27,17,4,47,950397),'log_count / DEBUG ':1,'log_count / INFO':7,'memusage / max':53383168,'memusage / startup':53383168,'start_time':datetime.datetime(2019,2,27,17,4,47,947520) } 2019-02-27 18:04:47 [scrapy.core.engine]信息:蜘蛛关闭(已完成)

Here is my spider code: 这是我的蜘蛛代码:

    import scrapy
    from scrapy.loader import ItemLoader
    from scrapy.loader.processors import MapCompose
    from scrapy.loader.processors import TakeFirst
    from jumia.items import JumiaItem


    class ProductDetails (scrapy.Spider):
        name = "jumiaProject"
        start_url = ["https://www.jumia.com.ng/computing/hp/"]

        def parse (self, response):

            search_results = response.css('section.products.-mabaya > div')

            for product in search_results: 

                product_loader = ItemLoader(item=JumiaItem(), selector=product)

                product_loader.add_css('brand','h2.title > span.brand::text')

                product_loader.add_css('name', 'h2.title > span.name::text')

                product_loader.add_css('link', 'a.link::attr(href)')


                yield product_loader.load_item()

Here is my items.py : 这是我的items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy
from scrapy.loader.processors import MapCompose
class JumiatesteItem(scrapy.Item):
    # define the fields for your item here like:
    name  = scrapy.Field()
    brand = scrapy.Field()
    price = scrapy.Field()
    link  = scrapy.Field()

The correct variable name in your Spider should be start_urls , not start_url . Spider中正确的变量名称应该是start_urls ,而不是start_url Because of the wrong name it doesn't detect any URLs. 由于名称错误,它无法检测到任何URL。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM