簡體   English   中英

xpath 用於屬性內的 href 元素

[英]xpath for href element inside a attribute

我在這里處理分頁。 如何從下面的 HTML 選擇器中獲取 href 值? 我不能使用//a[@data-page-number ='2']/@href因為每頁后 2 都會變為 3。

 <a data-page-number="2" data-offset="30" href="/Restaurants-g297633-oa30-Kochi_Cochin_Ernakulam_District_Kerala.html#EATERY_LIST_CONTENTS" class="nav next rndBtn ui_button primary taLnk" onclick=" require('common/Radio')('restaurant-filters').emit('paginate', this.getAttribute('data-offset'));; ta.trackEventOnPage('STANDARD_PAGINATION', 'next', '2', 0); return false; "> Next </a>

您想獲取next按鈕的href屬性。

devTools 下一步按鈕

如您所見,它在onclick屬性中具有next值,因此我們可以使用它來過濾所有其他a標簽。

Scrapy shell 為例

In [1]: url='https://www.tripadvisor.in/Restaurants-g297633-Kochi_Cochin_Ernakulam_District_Kerala.html#EATERY_LIST_CON
   ...: TENTS'

In [2]: req = scrapy.Request(url=url)

In [3]: fetch(req)
[scrapy.core.engine] INFO: Spider opened
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.in/Restaurants-g297633-Kochi_Cochin_Ernakulam_District_Kerala.html#EATERY_LIST_CONTENTS> (referer: None)

In [4]: response.xpath('//a[contains(@onclick, "next")]/@href').get()
Out[4]: '/Restaurants-g297633-oa30-Kochi_Cochin_Ernakulam_District_Kerala.html#EATERY_LIST_CONTENTS'
//*[@class="unified pagination js_pageLinks"]/a[2]/@href

上面的 xpath 表達式用於下一頁的分頁含義正在工作。 //*[@class="unified pagination js_pageLinks"]/a同時選擇上一頁和下一頁url。所以通過切片,你必須取下一頁url。

當然,當你將select元素,轉JavaScript,否則它將與動態混合搭配static元素。

用於分頁的完整工作代碼:

import scrapy
class TestSpider(scrapy.Spider):
    name = 'tes'
    start_urls = ['https://www.tripadvisor.in/Restaurants-g297633-oa60-Kochi_Cochin_Ernakulam_District_Kerala.html#EATERY_LIST_CONTENTS']

    def parse(self, response):
        
        for card in response.xpath('//*[@class="zdCeB Vt o"]'):
            yield {'Title':card.xpath('.//a[@class="Lwqic Cj b"][1]//text()').getall()[-1]}

        next_page = response.xpath('//*[@class="unified pagination js_pageLinks"]/a[2]/@href').get()
        if next_page:
            next_page_url = response.urljoin(next_page)
            yield scrapy.Request(next_page_url,callback=self.parse)

Output:

{'Title': 'Vanitha Hotel'}
2022-09-25 22:39:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.in/Restaurants-g297633-oa1110-Kochi_Cochin_Ernakulam_District_Kerala.html#EATERY_LIST_CONTENTS> (referer: https://www.tripadvisor.in/Restaurants-g297633-oa1080-Kochi_Cochin_Ernakulam_District_Kerala.html)
2022-09-25 22:39:07 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.in/Restaurants-g297633-oa1110-Kochi_Cochin_Ernakulam_District_Kerala.html>
{'Title': 'The Muyal RESTAURANT'}
2022-09-25 22:39:07 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.in/Restaurants-g297633-oa1110-Kochi_Cochin_Ernakulam_District_Kerala.html>
{'Title': 'K K R Food Products'}
2022-09-25 22:39:07 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.in/Restaurants-g297633-oa1110-Kochi_Cochin_Ernakulam_District_Kerala.html>
{'Title': 'Akathalam Homely Food'}
2022-09-25 22:39:07 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.in/Restaurants-g297633-oa1110-Kochi_Cochin_Ernakulam_District_Kerala.html>
{'Title': 'Thanneer Mathan Restaurant'}
2022-09-25 22:39:07 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.in/Restaurants-g297633-oa1110-Kochi_Cochin_Ernakulam_District_Kerala.html>
{'Title': 'Holly Hock'}
2022-09-25 22:39:07 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.in/Restaurants-g297633-oa1110-Kochi_Cochin_Ernakulam_District_Kerala.html>
{'Title': 'Cochin Halwa Centre'}
2022-09-25 22:39:07 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.in/Restaurants-g297633-oa1110-Kochi_Cochin_Ernakulam_District_Kerala.html>
{'Title': 'Canvas Restaurant Pizzeria'}
2022-09-25 22:39:07 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.in/Restaurants-g297633-oa1110-Kochi_Cochin_Ernakulam_District_Kerala.html>
{'Title': 'Canvas Restaurant & Pizzeria'}
2022-09-25 22:39:07 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.in/Restaurants-g297633-oa1110-Kochi_Cochin_Ernakulam_District_Kerala.html>
{'Title': 'Cafe Delaviz'}
2022-09-25 22:39:07 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.in/Restaurants-g297633-oa1110-Kochi_Cochin_Ernakulam_District_Kerala.html>
{'Title': 'Cafe Sora'}
2022-09-25 22:39:07 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.in/Restaurants-g297633-oa1110-Kochi_Cochin_Ernakulam_District_Kerala.html>
{'Title': 'Honey Dew Bakery'}
2022-09-25 22:39:07 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.in/Restaurants-g297633-oa1110-Kochi_Cochin_Ernakulam_District_Kerala.html>
{'Title': 'Food Barrel Restaurant'}
2022-09-25 22:39:07 [scrapy.core.engine] INFO: Closing spider (finished)
2022-09-25 22:39:07 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 152484,
 'downloader/request_count': 36,
 'downloader/request_method_count/GET': 36,
 'downloader/response_bytes': 4029630,
 'downloader/response_count': 36,
 'downloader/response_status_count/200': 36,
 'elapsed_time_seconds': 62.328141,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 9, 25, 16, 39, 7, 777225),
 'httpcompression/response_bytes': 22935503,
 'httpcompression/response_count': 36,
 'item_scraped_count': 1062,

您可以使用

"//a[@data-page-number]/@href"

這將定位a帶有data-page-number屬性的標簽元素。 我想這應該是唯一的定位器。
UPD
您使用了錯誤的工具進行驗證。
xpather.com是比較好的 XPath 表達式驗證工具。
在此處輸入圖像描述

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM