简体   繁体   English

抓取电子商务网站 daraz.pk 时出错

[英]Error in scraping an ecommerce website daraz.pk

I am trying to scrape daraz.pk and ran into this error.The spider scrapes all the values on the page until the last value because it returns None value and then the spider throws an NoneType object is not iterable.我正在尝试抓取 daraz.pk 并遇到此错误。蜘蛛抓取页面上的所有值,直到最后一个值,因为它返回 None 值,然后蜘蛛抛出 NoneType object is not iterable。 I have tried using exception handling methods but didn't work anyways im sharing my code here if anyone can help out.I'm using selenium and scrapy together to get the description of items on the items page我尝试过使用异常处理方法,但无论如何都没有工作,如果有人可以帮忙,我在这里分享我的代码。我正在使用 selenium 和 scrapy 一起获取项目页面上项目的描述

** **

import scrapy
from selenium.webdriver import Chrome
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from ..items import EcomItem
class DarazSpider(scrapy.Spider):
    name = 'daraz'
    def start_requests(self):
        path = 'C:\Program Files (x86)\chromedriver.exe'
        driver = Chrome(executable_path=path)
        driver.get('https://www.daraz.pk/')
        electronics = driver.find_element(By.NAME, 'q')
        electronics.send_keys('Books')
        electronics.send_keys(Keys.RETURN)
        link_elements = driver.find_elements(By.XPATH,'/html/body/div[3]/div/div[2]/div/div/div/div[2]/div/div/div/div[2]/div[2]/a[text()]')
        for link_el in link_elements:
                    href = link_el.text
                    print(href)
    def parse(self, response):
        pass

** **

here is the error这是错误

** **

Traceback (most recent call last):
    d = crawler.crawl(*args, **kwargs)
  File "C:\Users\Intag\New folder (2)\lib\site-packages\twisted\internet\defer.py", line 1905, in unwindGenerator
    return _cancellableInlineCallbacks(gen)
  File "C:\Users\Intag\New folder (2)\lib\site-packages\twisted\internet\defer.py", line 1815, in _cancellableInlineCallbacks
    _inlineCallbacks(None, gen, status)
--- <exception caught here> ---
  File "C:\Users\Intag\New folder (2)\lib\site-packages\twisted\internet\defer.py", line 1660, in _inlineCallbacks
    result = current_context.run(gen.send, result)
  File "C:\Users\Intag\New folder (2)\lib\site-packages\scrapy\crawler.py", line 103, in crawl
    start_requests = iter(self.spider.start_requests())
builtins.TypeError: 'NoneType' object is not iterable
2022-08-06 10:29:20 [twisted] CRITICAL:
Traceback (most recent call last):
  File "C:\Users\Intag\New folder (2)\lib\site-packages\twisted\internet\defer.py", line 1660, in _inlineCallbacks
    result = current_context.run(gen.send, result)
  File "C:\Users\Intag\New folder (2)\lib\site-packages\scrapy\crawler.py", line 103, in crawl
    start_requests = iter(self.spider.start_requests())
TypeError: 'NoneType' object is not iterable

** **

You can get the desired data from API .您可以从API获得所需的数据。 As data is loaded dynamically by JAvaScript via API which is GET method and data is in json format.由于数据由 JAvaScript 通过 API 动态加载,这是GET方法,数据采用 json 格式。 It's the super easiest and the robust way to grab data.这是获取数据的最简单、最可靠的方法。

Example:例子:

import scrapy
import json
from scrapy.crawler import CrawlerProcess
class TestSpider(scrapy.Spider):
    name = 'test'

    custom_settings = {
        'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
        'DOWNLOAD_DELAY': 1
        }

    def start_requests(self):
        headers= {
            'content-type': 'application/json',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
        }
        api_url='https://www.daraz.pk/catalog/?_keyori=ss&ajax=true&clickTrackInfo=textId--2543448522407782846__abId--296224__pvid--721834c6-aa06-4851-a758-c1dceed517aa__matchType--1__srcQuery--None__spellQuery--books&from=suggest_normal&page=1&q=books&spm=a2a0e.home.search.1.35e34937dlzwzf&sugg=books_0_1'
        yield scrapy.Request(
            url= api_url,
            method='GET',
            headers=headers,
            callback=self.parse
            )
       
    def parse(self, response):
    
        resp = json.loads(response.body)
        for item in resp['mods']['listItems']:
            yield {
                'productUrl':'https:' + item['productUrl']
            } 
       
if __name__ == "__main__":
    process = CrawlerProcess(TestSpider)
    process.crawl()
    process.start()

Output: Output:

Crawled (200) <GET https://www.daraz.pk/catalog/?_keyori=ss&ajax=true&clickTrackInfo=textId--2543448522407782846__abId--296224__pvid--721834c6-aa06-4851-a758-c1dceed517aa__matchType--1__srcQuery--None__spellQuery--books&from=suggest_normal&page=1&q=books&spm=a2a0e.home.search.1.35e34937dlzwzf&sugg=books_0_1> (referer: None)   
2022-08-06 12:08:42 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.pk/catalog/?_keyori=ss&ajax=true&clickTrackInfo=textId--2543448522407782846__abId--296224__pvid--721834c6-aa06-4851-a758-c1dceed517aa__matchType--1__srcQuery--None__spellQuery--books&from=suggest_normal&page=1&q=books&spm=a2a0e.home.search.1.35e34937dlzwzf&sugg=books_0_1>
{'productUrl': 'https://www.daraz.pk/products/5-i144834997-s1306536157.html?search=1'}        
2022-08-06 12:08:42 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.pk/catalog/?_keyori=ss&ajax=true&clickTrackInfo=textId--2543448522407782846__abId--296224__pvid--721834c6-aa06-4851-a758-c1dceed517aa__matchType--1__srcQuery--None__spellQuery--books&from=suggest_normal&page=1&q=books&spm=a2a0e.home.search.1.35e34937dlzwzf&sugg=books_0_1>
{'productUrl': 'https://www.daraz.pk/products/4-i146864039-s1309826616.html?search=1'}        
2022-08-06 12:08:42 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.pk/catalog/?_keyori=ss&ajax=true&clickTrackInfo=textId--2543448522407782846__abId--296224__pvid--721834c6-aa06-4851-a758-c1dceed517aa__matchType--1__srcQuery--None__spellQuery--books&from=suggest_normal&page=1&q=books&spm=a2a0e.home.search.1.35e34937dlzwzf&sugg=books_0_1>
{'productUrl': 'https://www.daraz.pk/products/-i229320627-s1449691508.html?search=1'}
2022-08-06 12:08:42 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.pk/catalog/?_keyori=ss&ajax=true&clickTrackInfo=textId--2543448522407782846__abId--296224__pvid--721834c6-aa06-4851-a758-c1dceed517aa__matchType--1__srcQuery--None__spellQuery--books&from=suggest_normal&page=1&q=books&spm=a2a0e.home.search.1.35e34937dlzwzf&sugg=books_0_1>
{'productUrl': 'https://www.daraz.pk/products/-i229571902-s1449944276.html?search=1'}
2022-08-06 12:08:42 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.pk/catalog/?_keyori=ss&ajax=true&clickTrackInfo=textId--2543448522407782846__abId--296224__pvid--721834c6-aa06-4851-a758-c1dceed517aa__matchType--1__srcQuery--None__spellQuery--books&from=suggest_normal&page=1&q=books&spm=a2a0e.home.search.1.35e34937dlzwzf&sugg=books_0_1>
{'productUrl': 'https://www.daraz.pk/products/-i219883778-s1432847877.html?search=1'}
2022-08-06 12:08:42 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.pk/catalog/?_keyori=ss&ajax=true&clickTrackInfo=textId--2543448522407782846__abId--296224__pvid--721834c6-aa06-4851-a758-c1dceed517aa__matchType--1__srcQuery--None__spellQuery--books&from=suggest_normal&page=1&q=books&spm=a2a0e.home.search.1.35e34937dlzwzf&sugg=books_0_1>
{'productUrl': 'https://www.daraz.pk/products/pmc-nmdcat-nums-agha-khan-2022-i209146784-s1415196801.html?search=1'}
2022-08-06 12:08:42 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.pk/catalog/?_keyori=ss&ajax=true&clickTrackInfo=textId--2543448522407782846__abId--296224__pvid--721834c6-aa06-4851-a758-c1dceed517aa__matchType--1__srcQuery--None__spellQuery--books&from=suggest_normal&page=1&q=books&spm=a2a0e.home.search.1.35e34937dlzwzf&sugg=books_0_1>
{'productUrl': 'https://www.daraz.pk/products/nmdcat-bookmbbscommbbscompkpmc-mdcat-practice-books-2022entry-test-preparation-booksentry-test-booksentry-test-preparation-books-2022guide-for-solved-past-paper-papers-exam-exams-test-tests-book-n-books-bnb-multan-ghar-kitab-mkg-new-fareed-fbc-i276082277-s1491310765.html?search=1'}
2022-08-06 12:08:42 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.pk/catalog/?_keyori=ss&ajax=true&clickTrackInfo=textId--2543448522407782846__abId--296224__pvid--721834c6-aa06-4851-a758-c1dceed517aa__matchType--1__srcQuery--None__spellQuery--books&from=suggest_normal&page=1&q=books&spm=a2a0e.home.search.1.35e34937dlzwzf&sugg=books_0_1>
{'productUrl': 'https://www.daraz.pk/products/tenses-made-easy-by-efzal-anware-mufti-i209992860-s1416720338.html?search=1'}
2022-08-06 12:08:42 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.pk/catalog/?_keyori=ss&ajax=true&clickTrackInfo=textId--2543448522407782846__abId--296224__pvid--721834c6-aa06-4851-a758-c1dceed517aa__matchType--1__srcQuery--None__spellQuery--books&from=suggest_normal&page=1&q=books&spm=a2a0e.home.search.1.35e34937dlzwzf&sugg=books_0_1>
{'productUrl': 'https://www.daraz.pk/products/sk-original-golden-13medical-books-in-urdu-i198834812-s1395012400.html?search=1'}
2022-08-06 12:08:42 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.pk/catalog/?_keyori=ss&ajax=true&clickTrackInfo=textId--2543448522407782846__abId--296224__pvid--721834c6-aa06-4851-a758-c1dceed517aa__matchType--1__srcQuery--None__spellQuery--books&from=suggest_normal&page=1&q=books&spm=a2a0e.home.search.1.35e34937dlzwzf&sugg=books_0_1>
{'productUrl': 'https://www.daraz.pk/products/-i242170073-s1461239796.html?search=1'}
2022-08-06 12:08:42 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.pk/catalog/?_keyori=ss&ajax=true&clickTrackInfo=textId--2543448522407782846__abId--296224__pvid--721834c6-aa06-4851-a758-c1dceed517aa__matchType--1__srcQuery--None__spellQuery--books&from=suggest_normal&page=1&q=books&spm=a2a0e.home.search.1.35e34937dlzwzf&sugg=books_0_1>
{'productUrl': 'https://www.daraz.pk/products/-i270001029-s1483708982.html?search=1'}
2022-08-06 12:08:42 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.pk/catalog/?_keyori=ss&ajax=true&clickTrackInfo=textId--2543448522407782846__abId--296224__pvid--721834c6-aa06-4851-a758-c1dceed517aa__matchType--1__srcQuery--None__spellQuery--books&from=suggest_normal&page=1&q=books&spm=a2a0e.home.search.1.35e34937dlzwzf&sugg=books_0_1>
{'productUrl': 'https://www.daraz.pk/products/css-pms-iqra-ud-din-css-o-css-2022-css-2023-i220043944-s1433189818.html?search=1'}

... so on ... 很快

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM