简体   繁体   English

python scrapy 从产品页面获取 url 列表

[英]python scrapy get list of urls from products page

I am trying to scrape this page: https://www.jomashop.com/watches.html?dir=desc&order=bestsellers我正在尝试抓取此页面: https://www.jomashop.com/watches.html?dir=desc&order=bestsellers

my current code to get a list of the product urls is:我当前获取产品网址列表的代码是:

def start_requests(self):
    yield scrapy.Request(
        url=self.start_url,
        callback=self.parse_product_url,
        headers=self.headers,
        dont_filter=True,
        meta={"page": 1}
        )

def parse_product_url(self, response):
    if response.status in self.handle_httpstatus_list:
        time.sleep(5)
        yield response.request
    else:
        meta_product = response.meta
        meta_product["rank"] = 0
        meta_product["category_url"] = response.url
        product_urls = response.xpath("//a[@class='productName-link']/a/@href").extract()

but the product_urls is consistently returning empty.但 product_urls 始终返回空。 I am not well versed in the syntax of xpath, so I think there is a problem with how I have worded it.我对xpath的语法不是很熟悉,所以我认为我的措辞有问题。 Any help to get the product url is appreciated!获得产品 url 的任何帮助表示赞赏! thank you!谢谢你!

I have also tried:我也试过:

response.xpath("//div[@class='product-details']/h2[@class='productName-link']/text()")
response.xpath("//a[@class='productName-link']/a/@href").extract()
response.xpath("//*[contains(@class, 'product-details')]/a/@href").extract()
response.xpath("//*[contains(@class, 'productName-link')]/a/@href").extract()
response.xpath("//*[@id='product-list-wrapper']/ul/li[4]/div/div[2]/h2/a").extract()
response.xpath("/html/body/div[1]/div/main/div[1]/div[2]/div[2]/div[2]/ul/li[4]/div/div[2]/h2/a").extract()

I feel like I am so close, but none of these seem to be working.我觉得我很接近,但这些似乎都不起作用。

It doesn't look like that link provides any links by that class in the source (the source is the data that comes from the web server to the client, which in this case is your python script).看起来该链接没有通过源中的 class 提供任何链接(源是来自 web 服务器到客户端的数据,在这种情况下是您的 python 脚本)。 If you look at the page source, it's mainly just JavaScript.如果看页面源码,主要就是JavaScript。 But if you inspect the DOM from within a browser, you see the links you are looking for, meaning the page uses JavaScript (most likely) to populate the DOM with the elements you are looking for.但是,如果您从浏览器中检查 DOM,您会看到您正在寻找的链接,这意味着该页面使用 JavaScript(很可能)用您正在寻找的元素填充 DOM。 This is done because the browser runs the JavaScript which populates the DOM.这样做是因为浏览器运行填充 DOM 的 JavaScript。 Scrapy does not execute any JavaScript and all it has access to is the data sent initially by the Web Server. Scrapy 不执行任何 JavaScript,它只能访问最初由 Web 服务器发送的数据。 Scrapy's documentation refers to this as dynamically-loaded content and provides this suggestion: Scrapy 的文档将此称为动态加载的内容并提供以下建议:

Selecting dynamically-loaded content Some webpages show the desired data when you load them in a web browser.选择动态加载的内容 当您在 web 浏览器中加载某些网页时,它们会显示所需的数据。 However, when you download them using Scrapy, you cannot reach the desired data using selectors.但是,当您使用 Scrapy 下载它们时,您无法使用选择器获得所需的数据。 When this happens, the recommended approach is to find the data source and extract the data from it.发生这种情况时,推荐的方法是找到数据源并从中提取数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM