簡體   English   中英

使用 Python 和 Selenium 抓取動態內容時選擇適當的定位器

[英]Choosing appropriate locators when scraping dynamic content with Python and Selenium

I am trying to understand the correct way to select specific elements of a webpage using python and selenium, I am uncertain what dictates which approach to take such as xpath or CSS and so on.

https://dutchie.com/embedded-menu/revolutionary-clinics-somerville/menu

 <a class="consumer-product-card__StyledLink-ncbvk2-1 jpGhIo" href="/embedded-menu/berkshire-roots/menu/cbd-tincture-2-1-225mg"> <span>CBD Tincture 2:1 225mg Details</span> <div class="product-card__Container-sc-7s6mw-0 iWHVJj"> <div class="product-card__Content-sc-7s6mw-1 cfcIOW"> <div class="product-information__Container-sc-65h5ke-0 ejVwks"> <img class="product-information__StyledProductImage-sc-65h5ke-1 jupjtQ" width="218" height="218" src="https://images.dutchie.com/0f253b35120facc1465b75b08bfd4d66?auto=format&amp;dpr=1&amp;bg=FFFFFF&amp;crop=faces&amp;fit=fill&amp;w=218&amp;h=218&amp;ixlib=react-7.2.0" alt="" srcset="https://images.dutchie.com/0f253b35120facc1465b75b08bfd4d66?auto=format&amp;dpr=2&amp;bg=FFFFFF&amp;crop=faces&amp;fit=fill&amp;w=218&amp;h=218&amp;ixlib=react-7.2.0 2x, https://images.dutchie.com/0f253b35120facc1465b75b08bfd4d66?auto=format&amp;dpr=3&amp;bg=FFFFFF&amp;crop=faces&amp;fit=fill&amp;w=218&amp;h=218&amp;ixlib=react-7.2.0 3x"> <div class="product-information__ProductInfo-sc-65h5ke-2 bwhblJ"> <div class="product-information__Price-sc-65h5ke-7 eEqLUB">$36.95</div> <div class="product-information__BrandContainer-sc-65h5ke-5 dlSlvE list-only"> <div class="product-information__Brand-sc-65h5ke-6 ftehWE">Berkshire Roots</div> </div> <div class="product-information__TitleContainer-sc-65h5ke-3 fOoVwz list-only false"> <div class="product-information__Title-sc-65h5ke-4 eBIyJW --line2">CBD Tincture 2:1 225mg</div> </div> <div class="product-information__TitleContainer-sc-65h5ke-3 fOoVwz mobile-and-card"> <div class="product-information__Title-sc-65h5ke-4 eBIyJW">CBD Tincture 2:1</div> <div class="product-information__Title-sc-65h5ke-4 eBIyJW --line2"> 225mg</div> </div> <div class="product-information__DetailsContainer-sc-65h5ke-9 ifqkuO"> <div class="product-information__Strain-sc-65h5ke-10 eWkod --high-cbd">High CBD</div> <div class="product-information__PotencyInfo-sc-65h5ke-14 gUReQf"><b>THC:&nbsp;</b>72.3 mg&nbsp;&nbsp;|&nbsp;&nbsp;<b>CBD:&nbsp;</b>160.3 mg</div> </div> </div> </div> <div class="product-weights__Container-nwgli1-0 gwUwAi"> <div class="product-weights__Weights-nwgli1-1 kiObrJ"> <div aria-label="Add 0.41g to cart for $36.95" data-cy="product-card-weight" class="weight__Container-sc-11f1l3-2 dNvnhd"> <div class="weight__Price-sc-11f1l3-4 ZtHqz">$36.95</div> <div class="weight__IconContainer-sc-11f1l3-1 zqIJt"> <svg xmlns="http://www.w3.org/2000/svg" width="11" height="11" viewBox="0 0 10 10"> <path fill="#A6ACB3" fill-rule="nonzero" d="M9.176 5c0-.407-.031-.723-.438-.723l-3.022.007.007-3.022c0-.407-.326-.428-.722-.438-.407 0-.723.03-.722.436l.003 3.012-3.022.007c-.406 0-.426.325-.436.722-.01.396.031.722.438.722l3.022-.007.003 3.012c0.407.326.427.723.438.407 0.722-.03.721-.437l-.003-3.011 3.012.003c.406 0.437-.315.436-.722z"></path> </svg> </div> </div> <div class="product-weights__Fill-nwgli1-2 dtfdkt"></div> </div> </div> </div> </div> </a>

如何在不滾動到頁面底部的情況下使用各種循環來訪問每個“消費者產品卡”? 還是我需要強制頁面先滾動? “消費者產品卡”方法是否正確,還是 xpath 更有意義? 無論出於什么原因,我都很難理解哪個是理想的,甚至很難理解 select 在一個實例中,以及下一個和下一個,直到我到達終點。

謝謝你。

這是一個自以為是的問題。

我可能會使用我能找到的最簡單的 CSS 選擇器,它唯一地定義了元素。 XPath 速度較慢,而且我發現可能更脆弱,更難找到好的元素選擇器。 但是沒有“正確”的方法。

我對問題的 rest 的目標有點困惑。 我認為我們需要更多細節以及您用來嘗試此操作的代碼。

此外,您的 HTML 格式為一行,很難查看。

要查找所有卡,請使用:

driver.find_elements_by_xpath("//div[contains(@class,'consumer-product-card__InViewContainer-ncbvk2-0 dWfGpk')]")

然后使用我在上一個問題中給你的鏈接作為示例。

更新

開始的解決方案:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


driver = webdriver.Chrome(executable_path='/snap/bin/chromium.chromedriver')

driver.get('https://dutchie.com/embedded-menu/revolutionary-clinics-somerville/menu')

wait = WebDriverWait(driver, 30)
wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".product-card__Content-sc-7s6mw-1.cfcIOW")))
cards = driver.find_elements_by_css_selector(".product-card__Content-sc-7s6mw-1.cfcIOW")

data = []
for card in cards:
    name = card.find_element_by_css_selector(".product-information__TitleContainer-sc-65h5ke-3.fOoVwz.list-only").text
    data.append(name)
for i in data:
    print(i)

它等待卡片並打印他們的名字。 但是滾動等,其他元素是完全不同的問題。 我發現 css 選擇器更適合這種情況。

結果是三個項目:

Rick Simpson Oil (RSO)
Live Sugar - Purple Pineapple Express
Live Sugar - Gelato #33

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM