簡體   English   中英

如何使用 python Scrapy 從本網站提取汽車鏈接

[英]How can I extract car links from the this website using python Scrapy

在這里,我正在嘗試從該網站提取所有汽車鏈接“* https://www.euroncap.com/en/ratings-rewards/electric-vehicles/#?selectedMake=0&selectedMakeName=Select%20a%20make&selectedModel=0&selectedStar= &includeFullSafetyPackage=true&includeStandardSafetyPackage=true&selectedModelName=All&selectedProtocols=45155,41776&selectedClasses=1202,1199,1201,1196,1205,1203,1198,1179,40250,1197,1204,1180,34736,44997&allClasses=true&allProtocols=false&allDriverAssistanceTechnologies=false&selectedDriverAssistanceTechnologies=&thirdRowFitment=false* “例如。 我正在嘗試提取“沃爾沃 c40 充電”的鏈接以提取我使用的 python Scrapy response.css('div.rating-table-row-c.c9 a').xpath('@href').extract()但我得到的輸出為['/en{{assessment.Url}}']但實際的 url 是“/en/results/volvo/c40-recharge/45878” 我怎樣才能提取這個?

這些數據是用 JavaScript 渲染的,所以你不能直接用 scrapy 獲取它(除非你使用 scrapy-splash 或 selenium-scrapy 等),你可以通過禁用 JavaScript 並重新加載頁面來看到。

如果您在 devtools 中打開“網絡”選項卡,您可以看到它從 JSON 文件中獲取數據。 所以你可以直接從這個文件中獲取你想要的數據。

使用scrapy shell的示例:

In [1]: headers = {
   ...: "Accept": "application/json, text/plain, */*",
   ...: "Accept-Encoding": "gzip, deflate, br",
   ...: "Accept-Language": "en-US,en;q=0.5",
   ...: "Cache-Control": "no-cache",
   ...: "Connection": "keep-alive",
   ...: "DNT": "1",
   ...: "Host": "www.euroncap.com",
   ...: "Pragma": "no-cache",
   ...: "Referer": "https://www.euroncap.com/en/ratings-rewards/electric-vehicles/",
   ...: "Sec-Fetch-Dest": "empty",
   ...: "Sec-Fetch-Mode": "cors",
   ...: "Sec-Fetch-Site": "same-origin",
   ...: "Sec-GPC": "1",
   ...: "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.372
   ...: 9.169 Safari/537.36"
   ...: }

In [2]: req = scrapy.Request(url='https://www.euroncap.com/Umbraco/EuroNCAP/SearchApi/GetAssessmentSearch?protocols=451
   ...: 55,41776&make=0&model=0&carClasses=1202,1199,1201,1196,1205,1203,1198,1179,40250,1197,1204,1180,34736,44997&dri
   ...: verAssistanceTechnologies=&allProtocols=false&allClasses=true&allDriverAssistanceTechnologies=false&includeFull
   ...: SafetyPackage=true&includeStandardSafetyPackage=true&showOnlyHybrid=true&showOnlyFleet=false&starNumber=&thirdR
   ...: owFitment=false', headers=headers)

In [3]: fetch(req)
[scrapy.core.engine] INFO: Spider opened
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.euroncap.com/Umbraco/EuroNCAP/SearchApi/GetAssessmentSearch?protocols=45155,41776&make=0&model=0&carClasses=1202,1199,1201,1196,1205,1203,1198,1179,40250,1197,1204,1180,34736,44997&driverAssistanceTechnologies=&allProtocols=false&allClasses=true&allDriverAssistanceTechnologies=false&includeFullSafetyPackage=true&includeStandardSafetyPackage=true&showOnlyHybrid=true&showOnlyFleet=false&starNumber=&thirdRowFitment=false> (referer: https://www.euroncap.com/en/ratings-rewards/electric-vehicles/)

In [4]: jsonData = response.json()

# The specific URL you requested (check the JSON file and loop through the data however you want to).
In [5]: print(jsonData['AssessmentSearchResults'][0]['Assessments'][1]['Url'])
/results/volvo/c40-recharge/45878

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM