簡體   English   中英

使用 Scrapy 抓取亞馬遜

[英]Scraping Amazon with Scrapy

我是 scrapy 的新手,我正在嘗試從 amazon.in 上獲取有關不同筆記本電腦的詳細信息。 我嘗試了此代碼,但出現錯誤。 我提供了代碼以及下面的錯誤。 各位大神可以提出一些解決方案嗎?

蜘蛛:

# -*- coding: utf-8 -*-
import scrapy


class AmazonLaptopsSpider(scrapy.Spider):
    name = 'amazon_laptops'
    allowed_domains = ['www.amazon.in']
    #start_urls = ['https://www.amazon.in/s?i=computers&bbn=976392031&rh=n%3A14584413031&ref=mega_elec_s23_2_1_1_5']


    def start_requests(self):
        yield scrapy.Request(url='https://www.amazon.in/s?k=laptops&ref=nb_sb_noss_2',callback=self.parse,headers={'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"})

    def parse(self, response):
        products=response.xpath("//div[@class='s-include-content-margin s-border-bottom s-latency-cf-section']/div/div[2]/div[2]/div")
        for product in products:
            link='https://www.amazon.in/'+product.xpath(".//div/div/div/div/h2/a/@href").get()
            yield{

            'name':product.xpath(".//div/div/div/div/h2/a/span/text()").get(),
            'rating':product.xpath(".//div/div/div[@class='sg-col-inner']/div[@class='a-section a-spacing-none a-spacing-top-micro']/div[@class='a-row a-size-small']/span[1]/@aria-label").get(),
            'No_of_reviewers':product.xpath(".//div/div/div/div[2]/div/span[2]/@aria-label").get(),
            'Discounted_Price':product.xpath(".//div[2]/div[1]/div/div[1]/div/div/a/span[@class='a-price']/span[@class='a-offscreen']/text()").get(),
            'Original_Price':product.xpath(".//div[2]/div[1]/div/div[1]/div/div/a/span[@class='a-price a-text-price']/span[@class='a-offscreen']/text()").get(),
            }
            yield response.follow(url=link,callback=self.parse_det,headers={'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"})



        next_page=response.urljoin(response.xpath("//li[@class='a-last']/a/@href").get())

        if next_page:
            yield scrapy.Request(url=next_page,callback=self.parse,headers={'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"})

    def parse_det(self,response):
        deets=response.xpath("//div[@class='column col1 ']/div/div[2]/div[@class='attrG']/div[@class='pdTab']/table/tbody")
        for det in deets:
            if det.xpath(".//tr[1]/td[@class='label']/text()").get=='Brand':
                yield{'Brand':det.xpath(".//tr[1]/td[@class='value']/text()").get()}
            if det.xpath(".///tr[4]/td[@class='label']/text()").get=='Item Weight':
                yield {'weight':det.xpath(".//tr[4]/td[@class='value']/text()").get()}
            if det.xpath(".//tr[8]/td[@class='label']/text()").get=='RAM Size':
                yield {'RAM':det.xpath(".//tr[8]/td[@class='value']/text()").get()}
            if det.xpath(".//tr[11]/td[@class='label']/text()").get=='Hard Drive Size':
                yield {'Hard disk size':det.xpath(".//tr[11]/td[@class='value']/text()").get()}
            if det.xpath(".//tr[14]/td[@class='label']/text()").get=='Processor Brand':
                yield {'Processor brand':det.xpath(".//tr[16]/td[@class='value']/text()").get()}
            if det.xpath(".//tr[18]/td[@class='label']/text()").get=='Processor Type':
                yield {'Processor Type':det.xpath(".//tr[18]/td[@class='value']/text()").get()}
            if det.xpath(".//tr[20]/td[@class='label']/text()").get=='Graphic Card Description':
                yield {'Graphic card description':det.xpath(".//tr[20]/td[@class='values']/text()").get()}
            if det.xpath(".//tr[23]/td[@class='label']/text()").get=='Screen Size':
                yield {'Screen size':det.xpath(".//tr[23]/td[@class='value']/text()").get()}

錯誤:

Spider error processing <GET https://www.amazon.in/Dell-3595-15-6-inch-Microsoft-Integrated/dp/B0839L8XW1/ref=sr_1_9?dchild=1&keywords=laptops&qid=1591612091&sr=8-9> (referer: https://www.amazon.in/s?k=laptops&ref=nb_sb_noss_2)

我可以從多個頁面中抓取所有內容,但是當 scrapy 訪問任何筆記本電腦的鏈接時,就會發生錯誤。

發生錯誤的部分:

 DEBUG: Scraped from <200 https://www.amazon.in/s?k=laptops&ref=nb_sb_noss_2>
{'name': 'ASUS VivoBook 15 X509FA-EJ341T 15.6-inch Laptop (8th Gen Core i3-8145U/4GB/1TB HDD/Windows 10 Home (64bit)/Intel Integrated UHD 620 Graphics), Transparent Silver', 'rating': '4.4 out of 5 stars', 'No_of_reviewers': '7', 'Discounted_Price': '₹30,900', 'Original_Price': '₹36,690'}
2020-06-08 16:02:18 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.amazon.in/HP-eq0132AU-15-6-inch-Windows-Graphics/dp/B08978XKP8/ref=sr_1_2_sspa?dchild=1&keywords=laptops&qid=1591612091&sr=8-2-spons&psc=1&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUExMEpUVThUSUJYMFMyJmVuY3J5cHRlZElkPUEwMjMzNDcwMU4zUzlMVzY1UFY2RyZlbmNyeXB0ZWRBZElkPUEwODQ0MjQxMjQzVzVCN01QNllVUCZ3aWRnZXROYW1lPXNwX2F0ZiZhY3Rpb249Y2xpY2tSZWRpcmVjdCZkb05vdExvZ0NsaWNrPXRydWU=> from <GET https://www.amazon.in/gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_aps_sr_pg1_2?ie=UTF8&adId=A0844241243W5B7MP6YUP&url=%2FHP-eq0132AU-15-6-inch-Windows-Graphics%2Fdp%2FB08978XKP8%2Fref%3Dsr_1_2_sspa%3Fdchild%3D1%26keywords%3Dlaptops%26qid%3D1591612091%26sr%3D8-2-spons%26psc%3D1&qualifier=1591612091&id=1813870958375055&widgetName=sp_atf>
2020-06-08 16:02:19 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.amazon.in/Lenovo-Ideapad-Generation-Windows-81VD0082IN/dp/B08667RQSK/ref=sr_1_1_sspa?dchild=1&keywords=laptops&qid=1591612091&sr=8-1-spons&psc=1&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUExMEpUVThUSUJYMFMyJmVuY3J5cHRlZElkPUEwMjMzNDcwMU4zUzlMVzY1UFY2RyZlbmNyeXB0ZWRBZElkPUEwNjAyNTgzMldVS0Q2VjU5RUkxUSZ3aWRnZXROYW1lPXNwX2F0ZiZhY3Rpb249Y2xpY2tSZWRpcmVjdCZkb05vdExvZ0NsaWNrPXRydWU=> from <GET https://www.amazon.in/gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_aps_sr_pg1_1?ie=UTF8&adId=A06025832WUKD6V59EI1Q&url=%2FLenovo-Ideapad-Generation-Windows-81VD0082IN%2Fdp%2FB08667RQSK%2Fref%3Dsr_1_1_sspa%3Fdchild%3D1%26keywords%3Dlaptops%26qid%3D1591612091%26sr%3D8-1-spons%26psc%3D1&qualifier=1591612091&id=1813870958375055&widgetName=sp_atf>
2020-06-08 16:02:19 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.amazon.in/Inspiron-5370-13-3-inch-i7-8550U-Graphics/dp/B07B6K4YM6/ref=sr_1_12_sspa?dchild=1&keywords=laptops&qid=1591612091&sr=8-12-spons&psc=1&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUExMEpUVThUSUJYMFMyJmVuY3J5cHRlZElkPUEwMjMzNDcwMU4zUzlMVzY1UFY2RyZlbmNyeXB0ZWRBZElkPUEwMzA0NzUxMk5NMTFNQktEWDdPWSZ3aWRnZXROYW1lPXNwX210ZiZhY3Rpb249Y2xpY2tSZWRpcmVjdCZkb05vdExvZ0NsaWNrPXRydWU=> from <GET https://www.amazon.in/gp/slredirect/picassoRedirect.html/ref=pa_sp_mtf_aps_sr_pg1_2?ie=UTF8&adId=A03047512NM11MBKDX7OY&url=%2FInspiron-5370-13-3-inch-i7-8550U-Graphics%2Fdp%2FB07B6K4YM6%2Fref%3Dsr_1_12_sspa%3Fdchild%3D1%26keywords%3Dlaptops%26qid%3D1591612091%26sr%3D8-12-spons%26psc%3D1&qualifier=1591612091&id=1813870958375055&widgetName=sp_mtf>
2020-06-08 16:02:19 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.amazon.in/Acer-14-inch-Windows-Charcoal-SF514-54T/dp/B082FHZW6V/ref=sr_1_11_sspa?dchild=1&keywords=laptops&qid=1591612091&sr=8-11-spons&psc=1&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUExMEpUVThUSUJYMFMyJmVuY3J5cHRlZElkPUEwMjMzNDcwMU4zUzlMVzY1UFY2RyZlbmNyeXB0ZWRBZElkPUEwMjc2NzQ3MlpTSTJOUUU5S0FKQSZ3aWRnZXROYW1lPXNwX210ZiZhY3Rpb249Y2xpY2tSZWRpcmVjdCZkb05vdExvZ0NsaWNrPXRydWU=> from <GET https://www.amazon.in/gp/slredirect/picassoRedirect.html/ref=pa_sp_mtf_aps_sr_pg1_1?ie=UTF8&adId=A02767472ZSI2NQE9KAJA&url=%2FAcer-14-inch-Windows-Charcoal-SF514-54T%2Fdp%2FB082FHZW6V%2Fref%3Dsr_1_11_sspa%3Fdchild%3D1%26keywords%3Dlaptops%26qid%3D1591612091%26sr%3D8-11-spons%26psc%3D1&qualifier=1591612091&id=1813870958375055&widgetName=sp_mtf>
2020-06-08 16:02:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.in/Lenovo-V145-AMD-A6-Laptop-Windows-81MTA000IH/dp/B083C9RDCW/ref=sr_1_7?dchild=1&keywords=laptops&qid=1591612091&sr=8-7> (referer: https://www.amazon.in/s?k=laptops&ref=nb_sb_noss_2)
2020-06-08 16:02:20 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.amazon.in/Lenovo-V145-AMD-A6-Laptop-Windows-81MTA000IH/dp/B083C9RDCW/ref=sr_1_7?dchild=1&keywords=laptops&qid=1591612091&sr=8-7> (referer: https://www.amazon.in/s?k=laptops&ref=nb_sb_noss_2)

您的代碼中有多個問題:

產量問題

理想情況下,您將創建一個字典,在結果中添加您想要的所有項目,然后“生成”最終字典。 您應該執行以下操作:

item = dict()
if det.xpath(".//tr[1]/td[@class='label']/text()").get=='Brand':
    item['Brand']= det.xpath(".//tr[1]/td[@class='value']/text()").get()
elif det.xpath(".///tr[4]/td[@class='label']/text()").get=='Item Weight':
    item['weight'] =det.xpath(".//tr[4]/td[@class='value']/text()").get()
.
.
.
yield item

獲取問題

您在調用 get() 的某些地方有錯字。 語法不正確,它的get() ,不是get

if det.xpath(".//tr[11]/td[@class='label']/text()").get=='Hard Drive Size':

應該

if det.xpath(".//tr[11]/td[@class='label']/text()").get()=='Hard Drive Size':

錯誤 XPATH

這就是導致立即錯誤的原因。 請注意,您有三個斜杠

if det.xpath(".///tr[4]/td[@class='label']/text()").get() == 'Item Weight':

應該

if det.xpath(".//tr[4]/td[@class='label']/text()").get() == 'Item Weight':

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM