简体   繁体   English

Scrapy:爬虫不爬行

[英]Scrapy: crawler doesn't crawl

I'm obviously new to python, scrapy and programming in general.我显然是 Python、scrapy 和编程的新手。 I'm trying to scrape this site but my code doesn't seem to work.我正在尝试抓取此站点,但我的代码似乎不起作用。 All the examples and tutorials I found deal with simple and plain websites.我发现的所有示例和教程都涉及简单而简单的网站。 Or maybe I just can't get my head around it.或者也许我只是无法理解它。 There are hundreds of results I need to scrape, and I really don't want to do it manually.我需要抓取数百个结果,我真的不想手动完成。

So at this instance im just trying to only get the href from the div object to check if it works.所以在这种情况下,我只是想只从 div 对象中获取 href 来检查它是否有效。 It doesn't.它没有。

import scrapy
import requests


class QuotesSpider(scrapy.Spider):
    name = "items"

    def start_requests(self):
        urls = [
            'https://www.bosch-professional.com/ar/es/dl/localizador-de-distribuidores/dealerslist/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        list_doc = open('list_doc.txt', 'w')
        for item in response.css('div.row.m-dealer_list__row'):
            yield {
                'text': item.css('a::attr(href)').extract(),

            }

When run on the terminal (ignoring robots) it returns:在终端上运行时(忽略机器人)它返回:

2019-01-30 23:57:13 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2019-01-30 23:57:13 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-01-30 23:57:13 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-01-30 23:57:13 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-01-30 23:57:13 [scrapy.core.engine] INFO: Spider opened
2019-01-30 23:57:13 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-01-30 23:57:13 [scrapy.extensions.telnet] DEBUG: Telnet console listening on #NUMBER
2019-01-30 23:57:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.bosch-professional.com/ar/es/dl/localizador-de-distribuidores/dealerslist/> (referer: None)
2019-01-30 23:57:16 [scrapy.core.engine] INFO: Closing spider (finished)
2019-01-30 23:57:16 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 276,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 70592,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 1, 31, 2, 57, 16, 541215),
 'log_count/DEBUG': 2,
 'log_count/INFO': 7,
 'memusage/max': 57974784,
 'memusage/startup': 57974784,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2019, 1, 31, 2, 57, 13, 861593)}
2019-01-30 23:57:16 [scrapy.core.engine] INFO: Spider closed (finished)

Thanks for any help you can provide.感谢您的任何帮助,您可以提供。

As far as I can see, there are really no such elements on the page:据我所知,页面上确实没有这样的元素:

In [2]: fetch("https://www.bosch-professional.com/ar/es/dl/localizador-de-distribuidores/dealerslist/")
2019-01-31 09:31:47 [scrapy.core.engine] INFO: Spider opened
2019-01-31 09:31:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.bosch-professional.com/ar/es/dl/localizador-de-distribuidores/dealerslist/> (referer: None, latency: 0.87 s)

In [3]: response.css('div.row.m-dealer_list__row')
Out[3]: []

But if you will try:但如果你尝试:

In [4]: response.css('div.m-dealer_citylist__card a::text').extract()
Out[4]: 
[u'25 DE MAYO - BS AS',
 u'25 DE MAYO - LA PAMP',
 u'25 DE MAYO',
 u'9 DE ABRIL',
...
 u'ZENON PEREYRA',
 u'Z\xc1RATE']

I have visited the website you are trying to scrap, but your css does not seems to be matching any attribute in HTML.我访问了您尝试删除的网站,但您的 css 似乎与 HTML 中的任何属性都不匹配。

There is no any tag with the class name of m-dealer_list__row没有任何类名为m-dealer_list__row

All i see is m-dealer_citylist我只看到m-dealer_citylist

If i describe your css:如果我描述你的css:

You css describes that you are extracting a div element with 2 classes one is row and the second one is m-dealer_list__row您的 css 描述您正在提取具有 2 个类的div元素,一个是row ,第二个是m-dealer_list__row

if you want a div with row class then in this div any tag with m-dealer_list__row you can try this:如果你想要一个带有row类的div那么在这个 div 中任何带有m-dealer_list__row标签你可以试试这个:

"div.row .m-dealer_list__row"

It is for you).这是给你的)。 understand it yourself自己理解

import scrapy
import requests


class QuotesSpider(scrapy.Spider):
    name = "items"

    def start_requests(self):
        urls = [
            'https://www.bosch-professional.com/ar/es/dl/localizador-de-distribuidores/dealerslist/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        list_doc = open('list_doc.txt', 'w')
        for item in response.css('.trackingTeaser'):
            href = item.css('a::attr(href)').extract_first()
            href = response.urljoin(href)
            list_doc.write(''.join(item.css('a::text').re(r'([^ \n\t]+)')) + ': ' + href + '\n')
        list_doc.close()

OUTPUT:输出:

在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM