繁体   English   中英

Xpath 更正

[英]Xpath correct it

这是我的 output:我想从我的 output 中删除这些'4.9 out of 5 stars', '1,795 ratings',

'4.9 out of 5 stars', '1,795 ratings', '#3,626 in Kitchen & Dining (', 'See Top 100 in Kitchen & Dining', ')', '#18 in', 'Measuring Spoons'

这是我的页面链接https://www.amazon.com/OXO-Squeeze-Silicone-Measuring-Stay-Cool/dp/B01434TUTU/ref=sr_1_41?crid=10FYGF4D5KRO0&keywords=measuring+tools+%26+scales&qid=1646057599&sprefix=measuring +tools+and+scal%2Caps%2C363&sr=8-41

在此处输入图像描述

这是我的代码:

from scrapy import Spider
from scrapy.http import Request


class AuthorSpider(Spider):
    name = 'pushpa'
    start_urls = ['https://www.amazon.com/s?k=measuring+tools+%26+scales&crid=10FYGF4D5KRO0&sprefix=measuring+tools+and+scal%2Caps%2C363&ref=nb_sb_ss_ts-doa-p_4_24']
    custom_settings = {
        'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
        'DOWNLOAD_DELAY': 1,
        'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
    }
    
    def parse(self, response):
        books = response.xpath("//div//h2//@href").extract()
        for book in books:
            url = response.urljoin(book)
            yield Request(url, callback=self.parse_book)




    def parse_book(self, response):
        coordinate = response.xpath("//table[@id='productDetails_detailBullets_sections1']//td//span//text()")[2:].extract()
        coordinate = [i.strip() for i in coordinate]
        # remove empty strings:s
        coordinate = [i for i in coordinate if i]
        yield{
            'Best_sellerrank':coordinate
        }
    

您的 xpath 是正确的,只需使用列表切片和拆分方法即可获得所需的 output。

from scrapy import Spider
from scrapy.http import Request


class AuthorSpider(Spider):
    name = 'pushpa'
    start_urls = ['https://www.amazon.com/s?k=measuring+tools+%26+scales&crid=10FYGF4D5KRO0&sprefix=measuring+tools+and+scal%2Caps%2C363&ref=nb_sb_ss_ts-doa-p_4_24']
    custom_settings = {
        'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
        'DOWNLOAD_DELAY': 1,
        'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36'
    }
    
    def parse(self, response):
        books = response.xpath("//div//h2//@href").extract()
        for book in books:
            url = response.urljoin(book)
            yield Request(url, callback=self.parse_book)




    def parse_book(self, response):
        coordinate = response.xpath("//table[@id='productDetails_detailBullets_sections1']//td//span//text()")[2:].extract()
        coordinate = [i.strip() for i in coordinate]
        # remove empty strings:s
        coordinate = [i for i in coordinate if i]
        coordinate = coordinate[:2]

        coordinate = ', '.join([i for i in coordinate])
    

        yield{
            'Best_sellerrank':coordinate
        }

Output:

{'Best_sellerrank': '4.7 out of 5 stars,  43,078 ratings'}

... 很快

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM