简体   繁体   中英

Error in Spider Crawling on Scrapy, User Agent not working

I'm quite new to web scraping on Python. Currently trying to crawl through Amazon's latest books. As on many tutorials, i use the Random User-Agent middleware picks up as in this link :

At first I managed to crawl the web page. However, in the past few days, python only return "Spider error processing". Perhaps it's because Amaz0n is blocking user agent or that there's something missing in my code which I cannot find.

Here's what the terminal returns:

2020-10-22 01:37:59 [scrapy.utils.log] INFO: Scrapy 2.3.0 started (bot: scrapyamazon)
2020-10-22 01:37:59 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.8.3 (default, Jul  2 2020, 17:30:36) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h  22 Sep 2020), cryptography 3.1.1, Platform Windows-10-10.0.18362-SP0 
2020-10-22 01:37:59 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-10-22 01:37:59 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'scrapyamazon',
 'NEWSPIDER_MODULE': 'scrapyamazon.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['scrapyamazon.spiders'],
 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
               '(KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'}
2020-10-22 01:37:59 [scrapy.extensions.telnet] INFO: Telnet Password: cd809e0ec7c2ec6a
2020-10-22 01:37:59 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2020-10-22 01:37:59 [faker.factory] DEBUG: Not in REPL -> leaving logger event level as is.
2020-10-22 01:37:59 [scrapy_fake_useragent.middleware] DEBUG: Loaded User-Agent provider: scrapy_fake_useragent.providers.FakeUserAgentProvider
2020-10-22 01:37:59 [scrapy_fake_useragent.middleware] INFO: Using '<class 'scrapy_fake_useragent.providers.FakeUserAgentProvider'>' as the User-Agent provider   
2020-10-22 01:37:59 [scrapy_fake_useragent.middleware] DEBUG: Loaded User-Agent provider: scrapy_fake_useragent.providers.FakeUserAgentProvider
2020-10-22 01:37:59 [scrapy_fake_useragent.middleware] INFO: Using '<class 'scrapy_fake_useragent.providers.FakeUserAgentProvider'>' as the User-Agent provider   
2020-10-22 01:37:59 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware',
 'scrapy_fake_useragent.middleware.RetryUserAgentMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-10-22 01:37:59 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-10-22 01:37:59 [scrapy.middleware] INFO: Enabled item pipelines:
['scrapyamazon.pipelines.ScrapyamazonPipeline']
2020-10-22 01:37:59 [scrapy.core.engine] INFO: Spider opened
2020-10-22 01:37:59 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-10-22 01:37:59 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-10-22 01:38:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/robots.txt> (referer: None)
2020-10-22 01:38:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/Books-Last-30-days/s?rh=n%3A283155%2Cp_n_publication_date%3A1250226011> 
(referer: None)
2020-10-22 01:38:01 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.amazon.com/Books-Last-30-days/s?rh=n%3A283155%2Cp_n_publication_date%3A1250226011> (referer: None)
Traceback (most recent call last):
  File "C:\Users\freud\anaconda3\envs\condatest\lib\site-packages\scrapy\utils\defer.py", line 120, in iter_errback
    yield next(it)
  File "C:\Users\freud\anaconda3\envs\condatest\lib\site-packages\scrapy\utils\python.py", line 347, in __next__
    return next(self.data)
  File "C:\Users\freud\anaconda3\envs\condatest\lib\site-packages\scrapy\utils\python.py", line 347, in __next__
    return next(self.data)
  File "C:\Users\freud\anaconda3\envs\condatest\lib\site-packages\scrapy\core\spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "C:\Users\freud\anaconda3\envs\condatest\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
    for x in result:
  File "C:\Users\freud\anaconda3\envs\condatest\lib\site-packages\scrapy\core\spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "C:\Users\freud\anaconda3\envs\condatest\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 340, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "C:\Users\freud\anaconda3\envs\condatest\lib\site-packages\scrapy\core\spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "C:\Users\freud\anaconda3\envs\condatest\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "C:\Users\freud\anaconda3\envs\condatest\lib\site-packages\scrapy\core\spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "C:\Users\freud\anaconda3\envs\condatest\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "C:\Users\freud\anaconda3\envs\condatest\lib\site-packages\scrapy\core\spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "C:\Users\freud\Documents\Demos\vstoolbox\scrapyamazon\scrapyamazon\spiders\amazon_spider.py", line 22, in parse
    price_kindle = response.css("div[./a[contains(text(),'Kindle')]]")
  File "C:\Users\freud\anaconda3\envs\condatest\lib\site-packages\scrapy\http\response\text.py", line 142, in css
    return self.selector.css(query)
  File "C:\Users\freud\anaconda3\envs\condatest\lib\site-packages\parsel\selector.py", line 264, in css
    return self.xpath(self._css2xpath(query))
  File "C:\Users\freud\anaconda3\envs\condatest\lib\site-packages\parsel\selector.py", line 267, in _css2xpath
    return self._csstranslator.css_to_xpath(query)
  File "C:\Users\freud\anaconda3\envs\condatest\lib\site-packages\parsel\csstranslator.py", line 109, in css_to_xpath
    return super(HTMLTranslator, self).css_to_xpath(css, prefix)
  File "C:\Users\freud\anaconda3\envs\condatest\lib\site-packages\cssselect\xpath.py", line 192, in css_to_xpath
    for selector in parse(css))
  File "C:\Users\freud\anaconda3\envs\condatest\lib\site-packages\cssselect\parser.py", line 415, in parse
    return list(parse_selector_group(stream))
  File "C:\Users\freud\anaconda3\envs\condatest\lib\site-packages\cssselect\parser.py", line 428, in parse_selector_group
    yield Selector(*parse_selector(stream))
  File "C:\Users\freud\anaconda3\envs\condatest\lib\site-packages\cssselect\parser.py", line 436, in parse_selector
    result, pseudo_element = parse_simple_selector(stream)
  File "C:\Users\freud\anaconda3\envs\condatest\lib\site-packages\cssselect\parser.py", line 498, in parse_simple_selector
    result = parse_attrib(result, stream)
  File "C:\Users\freud\anaconda3\envs\condatest\lib\site-packages\cssselect\parser.py", line 569, in parse_attrib
    attrib = stream.next_ident_or_star()
  File "C:\Users\freud\anaconda3\envs\condatest\lib\site-packages\cssselect\parser.py", line 829, in next_ident_or_star
    raise SelectorSyntaxError(
  File "<string>", line None
cssselect.parser.SelectorSyntaxError: Expected ident or '*', got <DELIM '.' at 4>
2020-10-22 01:38:01 [scrapy.core.engine] INFO: Closing spider (finished)
2020-10-22 01:38:01 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 678,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 4242,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'elapsed_time_seconds': 1.312107,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 10, 21, 18, 38, 1, 219171),
 'log_count/DEBUG': 5,
 'log_count/ERROR': 1,
 'log_count/INFO': 12,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/200': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'spider_exceptions/SelectorSyntaxError': 1,
 'start_time': datetime.datetime(2020, 10, 21, 18, 37, 59, 907064)}
2020-10-22 01:38:01 [scrapy.core.engine] INFO: Spider closed (finished)

Sorry to put everything here because I'm not sure where the error stems from. But I draw the conclusion that this has something to do with the user-agent being blocked/not recognised from this line: 2020-10-22 01:38:01 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.amazon.com/Books-Last-30-days/s?rh=n%3A283155%2Cp_n_publication_date%3A1250226011> (referer: None) Traceback (most recent call last):

For your reference, here's my full code:

  1. amazon_spider.py: https://pastebin.com/tBQqa2jQ
  2. items.py: https://pastebin.com/rkUTqWSz
  3. pipelines.py: https://pastebin.com/2fkYf57f
  4. settings.py: https://pastebin.com/ZtLXqsyW

Thank you in advance for your help!

The problem is that you are calling a method for a CSS selector and passing an XPath.

price_kindle = response.css("div[./a[contains(text(),'Kindle')]]")

Change to

price_kindle = response.xpath("div[./a[contains(text(),'Kindle')]]")

By the way, this is unrelated to the problem, but you are assigning values two times to the same variable, so the first will get overwritten by the second. Here:

    price_kindle = response.css("div[./a[contains(text(),'Kindle')]]")
    price_kindle = price_hardcover.xpath("./following-sibling::div//span[contains(@class,'a-offscreen')]/text()").extract(

However, as I mention, this is not related to the question. Just a heads-up.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM