简体   繁体   中英

Choosing appropriate locators when scraping dynamic content with Python and Selenium

I am trying to understand the correct way to select specific elements of a webpage using python and selenium, I am uncertain what dictates which approach to take such as xpath or CSS and so on.

https://dutchie.com/embedded-menu/revolutionary-clinics-somerville/menu

 <a class="consumer-product-card__StyledLink-ncbvk2-1 jpGhIo" href="/embedded-menu/berkshire-roots/menu/cbd-tincture-2-1-225mg"> <span>CBD Tincture 2:1 225mg Details</span> <div class="product-card__Container-sc-7s6mw-0 iWHVJj"> <div class="product-card__Content-sc-7s6mw-1 cfcIOW"> <div class="product-information__Container-sc-65h5ke-0 ejVwks"> <img class="product-information__StyledProductImage-sc-65h5ke-1 jupjtQ" width="218" height="218" src="https://images.dutchie.com/0f253b35120facc1465b75b08bfd4d66?auto=format&amp;dpr=1&amp;bg=FFFFFF&amp;crop=faces&amp;fit=fill&amp;w=218&amp;h=218&amp;ixlib=react-7.2.0" alt="" srcset="https://images.dutchie.com/0f253b35120facc1465b75b08bfd4d66?auto=format&amp;dpr=2&amp;bg=FFFFFF&amp;crop=faces&amp;fit=fill&amp;w=218&amp;h=218&amp;ixlib=react-7.2.0 2x, https://images.dutchie.com/0f253b35120facc1465b75b08bfd4d66?auto=format&amp;dpr=3&amp;bg=FFFFFF&amp;crop=faces&amp;fit=fill&amp;w=218&amp;h=218&amp;ixlib=react-7.2.0 3x"> <div class="product-information__ProductInfo-sc-65h5ke-2 bwhblJ"> <div class="product-information__Price-sc-65h5ke-7 eEqLUB">$36.95</div> <div class="product-information__BrandContainer-sc-65h5ke-5 dlSlvE list-only"> <div class="product-information__Brand-sc-65h5ke-6 ftehWE">Berkshire Roots</div> </div> <div class="product-information__TitleContainer-sc-65h5ke-3 fOoVwz list-only false"> <div class="product-information__Title-sc-65h5ke-4 eBIyJW --line2">CBD Tincture 2:1 225mg</div> </div> <div class="product-information__TitleContainer-sc-65h5ke-3 fOoVwz mobile-and-card"> <div class="product-information__Title-sc-65h5ke-4 eBIyJW">CBD Tincture 2:1</div> <div class="product-information__Title-sc-65h5ke-4 eBIyJW --line2"> 225mg</div> </div> <div class="product-information__DetailsContainer-sc-65h5ke-9 ifqkuO"> <div class="product-information__Strain-sc-65h5ke-10 eWkod --high-cbd">High CBD</div> <div class="product-information__PotencyInfo-sc-65h5ke-14 gUReQf"><b>THC:&nbsp;</b>72.3 mg&nbsp;&nbsp;|&nbsp;&nbsp;<b>CBD:&nbsp;</b>160.3 mg</div> </div> </div> </div> <div class="product-weights__Container-nwgli1-0 gwUwAi"> <div class="product-weights__Weights-nwgli1-1 kiObrJ"> <div aria-label="Add 0.41g to cart for $36.95" data-cy="product-card-weight" class="weight__Container-sc-11f1l3-2 dNvnhd"> <div class="weight__Price-sc-11f1l3-4 ZtHqz">$36.95</div> <div class="weight__IconContainer-sc-11f1l3-1 zqIJt"> <svg xmlns="http://www.w3.org/2000/svg" width="11" height="11" viewBox="0 0 10 10"> <path fill="#A6ACB3" fill-rule="nonzero" d="M9.176 5c0-.407-.031-.723-.438-.723l-3.022.007.007-3.022c0-.407-.326-.428-.722-.438-.407 0-.723.03-.722.436l.003 3.012-3.022.007c-.406 0-.426.325-.436.722-.01.396.031.722.438.722l3.022-.007.003 3.012c0.407.326.427.723.438.407 0.722-.03.721-.437l-.003-3.011 3.012.003c.406 0.437-.315.436-.722z"></path> </svg> </div> </div> <div class="product-weights__Fill-nwgli1-2 dtfdkt"></div> </div> </div> </div> </div> </a>

How would I use a loop of sorts to access each and every "consumer-product-card" without having scrolled to the bottom of the page? Or would I need to force the page to scroll first? Is the "consumer-product-card" approach correct or would xpath make more sense? With either I find it difficult to understand which is ideal for what reason, or even how to select it in one instance, as well as the next and next until I reach the end.

Thank you.

This is kind of an opinionated question.

I would likely use the simplest CSS Selector I can find that uniquely defines the element. XPath is slower and, I find, likely more brittle and harder to find good selectors for elements. But there is no "correct" approach.

I'm a little confused regarding the goal of the rest of the question. I think we would need some more detail and the code you've used to attempt this.

Also, your HTML is formatted on one line and very hard to view.

To find all cards use:

driver.find_elements_by_xpath("//div[contains(@class,'consumer-product-card__InViewContainer-ncbvk2-0 dWfGpk')]")

Then use as an example links I gave you in the previous question.

UPDATE

Solution to start with:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


driver = webdriver.Chrome(executable_path='/snap/bin/chromium.chromedriver')

driver.get('https://dutchie.com/embedded-menu/revolutionary-clinics-somerville/menu')

wait = WebDriverWait(driver, 30)
wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".product-card__Content-sc-7s6mw-1.cfcIOW")))
cards = driver.find_elements_by_css_selector(".product-card__Content-sc-7s6mw-1.cfcIOW")

data = []
for card in cards:
    name = card.find_element_by_css_selector(".product-information__TitleContainer-sc-65h5ke-3.fOoVwz.list-only").text
    data.append(name)
for i in data:
    print(i)

It waits for cards and prints their names. But scrolling etc, other elements are completely different questions. I found css selectors more suitable for this case.

Result is three items:

Rick Simpson Oil (RSO)
Live Sugar - Purple Pineapple Express
Live Sugar - Gelato #33

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM