简体   繁体   中英

Scrapy AttributeError: 'SoundseasySpider' object has no attribute 'crawler'

I am trying to scrape some date from the webpage soundseasy.com.au , but sometimes I get the error:

AttributeError: 'SoundseasySpider' object has no attribute 'crawler'

Here is my code, which use selenium web driver(self.browser instance) to fetch data from the dynamic page:

import scrapy
from ProductsScraper.items import ProductDataItem, ProductDataLoader
from utilities.common import MODE_SINGLE
from utilities.DynamicPageLoader import DynamicPageLoader

def start_requests(self):
    # scrape multi page data
    for page_count, url in zip(self.pages_counts, self.start_urls):
        yield scrapy.Request(url=url, callback=self.multi_parse,
                             meta={'page_count': page_count}, 
                             dont_filter=True)

def multi_parse(self, response):
    """
    Method fetched the pages, gets the product url links and scrape it
    by calling parse_product
    """
    selector = self.get_dynamic_page(url=response.url,
                                     page_count=response.meta.get('page_count', '1'))
    product_urls = selector.xpath('//div[@class="isp_product_info"]/a/@href').extract()
    self.logger.info('{} items should be scraped from the page: {},'
                     ' scroll_count:{}'.format(len(product_urls),
                                               response.url, response.meta.get('page_count', '1')))
    for product_url in product_urls:
        # construct absolute url
        url = "https://www.{}{}".format(self.allowed_domains[0], product_url)
        yield scrapy.Request(url=url, callback=self.parse_product, dont_filter=True)

def get_dynamic_page(self, url, page_count):
    """
    Fetch dynamic page using DynamicDownloader and return selector object
    """
    # construct search page url with the page count included
    pages_url = url + '&page_num={}'.format(page_count)
    self.logger.info("get_dynamic_page: {}".format(pages_url))
    self.browser.load_page(pages_url)
    return scrapy.Selector(text=self.browser.get_html_page())

What I am doing wrong? Any help is appreciated.

EDIT: I got the exception below:

  File "/home/user/python3.6.1/lib/python3.6/site-packages/Twisted-17.9.0-py3.6-linux-x86_64.egg/twisted/internet/defer.py", line 1384, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/home/user/python3.6.1/lib/python3.6/site-packages/Twisted-17.9.0-py3.6-linux-x86_64.egg/twisted/python/failure.py", line 408, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/home/user/python3.6.1/lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request
    defer.returnValue((yield download_func(request=request,spider=spider)))
twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean f
ashion: Connection lost.>]
During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/swampblu/python3.6.1/lib/python3.6/site-packages/Twisted-17.9.0-py3.6-linux-x86_64.egg/twisted/internet/defer.py", line 1386, in _inlineCallbacks
    result = g.send(result)
  File "/home/swampblu/python3.6.1/lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 66, in process_exception
    spider=spider)
  File "/home/swampblu/python3.6.1/lib/python3.6/site-packages/scrapy/downloadermiddlewares/retry.py", line 61, in process_exception
    return self._retry(request, exception, spider)
  File "/home/swampblu/python3.6.1/lib/python3.6/site-packages/scrapy/downloadermiddlewares/retry.py", line 71, in _retry
    stats = spider.crawler.stats
AttributeError: 'SoundsEasySpider' object has no attribute 'crawler'

The problem was because of anti scraping protection. The servers were rejecting requests. I have enabled AutoThrottle extension

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM