简体   繁体   English

使用 Python 和 Selenium 抓取动态内容时选择适当的定位器

[英]Choosing appropriate locators when scraping dynamic content with Python and Selenium

I am trying to understand the correct way to select specific elements of a webpage using python and selenium, I am uncertain what dictates which approach to take such as xpath or CSS and so on. I am trying to understand the correct way to select specific elements of a webpage using python and selenium, I am uncertain what dictates which approach to take such as xpath or CSS and so on.

https://dutchie.com/embedded-menu/revolutionary-clinics-somerville/menu https://dutchie.com/embedded-menu/revolutionary-clinics-somerville/menu

 <a class="consumer-product-card__StyledLink-ncbvk2-1 jpGhIo" href="/embedded-menu/berkshire-roots/menu/cbd-tincture-2-1-225mg"> <span>CBD Tincture 2:1 225mg Details</span> <div class="product-card__Container-sc-7s6mw-0 iWHVJj"> <div class="product-card__Content-sc-7s6mw-1 cfcIOW"> <div class="product-information__Container-sc-65h5ke-0 ejVwks"> <img class="product-information__StyledProductImage-sc-65h5ke-1 jupjtQ" width="218" height="218" src="https://images.dutchie.com/0f253b35120facc1465b75b08bfd4d66?auto=format&amp;dpr=1&amp;bg=FFFFFF&amp;crop=faces&amp;fit=fill&amp;w=218&amp;h=218&amp;ixlib=react-7.2.0" alt="" srcset="https://images.dutchie.com/0f253b35120facc1465b75b08bfd4d66?auto=format&amp;dpr=2&amp;bg=FFFFFF&amp;crop=faces&amp;fit=fill&amp;w=218&amp;h=218&amp;ixlib=react-7.2.0 2x, https://images.dutchie.com/0f253b35120facc1465b75b08bfd4d66?auto=format&amp;dpr=3&amp;bg=FFFFFF&amp;crop=faces&amp;fit=fill&amp;w=218&amp;h=218&amp;ixlib=react-7.2.0 3x"> <div class="product-information__ProductInfo-sc-65h5ke-2 bwhblJ"> <div class="product-information__Price-sc-65h5ke-7 eEqLUB">$36.95</div> <div class="product-information__BrandContainer-sc-65h5ke-5 dlSlvE list-only"> <div class="product-information__Brand-sc-65h5ke-6 ftehWE">Berkshire Roots</div> </div> <div class="product-information__TitleContainer-sc-65h5ke-3 fOoVwz list-only false"> <div class="product-information__Title-sc-65h5ke-4 eBIyJW --line2">CBD Tincture 2:1 225mg</div> </div> <div class="product-information__TitleContainer-sc-65h5ke-3 fOoVwz mobile-and-card"> <div class="product-information__Title-sc-65h5ke-4 eBIyJW">CBD Tincture 2:1</div> <div class="product-information__Title-sc-65h5ke-4 eBIyJW --line2"> 225mg</div> </div> <div class="product-information__DetailsContainer-sc-65h5ke-9 ifqkuO"> <div class="product-information__Strain-sc-65h5ke-10 eWkod --high-cbd">High CBD</div> <div class="product-information__PotencyInfo-sc-65h5ke-14 gUReQf"><b>THC:&nbsp;</b>72.3 mg&nbsp;&nbsp;|&nbsp;&nbsp;<b>CBD:&nbsp;</b>160.3 mg</div> </div> </div> </div> <div class="product-weights__Container-nwgli1-0 gwUwAi"> <div class="product-weights__Weights-nwgli1-1 kiObrJ"> <div aria-label="Add 0.41g to cart for $36.95" data-cy="product-card-weight" class="weight__Container-sc-11f1l3-2 dNvnhd"> <div class="weight__Price-sc-11f1l3-4 ZtHqz">$36.95</div> <div class="weight__IconContainer-sc-11f1l3-1 zqIJt"> <svg xmlns="http://www.w3.org/2000/svg" width="11" height="11" viewBox="0 0 10 10"> <path fill="#A6ACB3" fill-rule="nonzero" d="M9.176 5c0-.407-.031-.723-.438-.723l-3.022.007.007-3.022c0-.407-.326-.428-.722-.438-.407 0-.723.03-.722.436l.003 3.012-3.022.007c-.406 0-.426.325-.436.722-.01.396.031.722.438.722l3.022-.007.003 3.012c0.407.326.427.723.438.407 0.722-.03.721-.437l-.003-3.011 3.012.003c.406 0.437-.315.436-.722z"></path> </svg> </div> </div> <div class="product-weights__Fill-nwgli1-2 dtfdkt"></div> </div> </div> </div> </div> </a>

How would I use a loop of sorts to access each and every "consumer-product-card" without having scrolled to the bottom of the page?如何在不滚动到页面底部的情况下使用各种循环来访问每个“消费者产品卡”? Or would I need to force the page to scroll first?还是我需要强制页面先滚动? Is the "consumer-product-card" approach correct or would xpath make more sense? “消费者产品卡”方法是否正确,还是 xpath 更有意义? With either I find it difficult to understand which is ideal for what reason, or even how to select it in one instance, as well as the next and next until I reach the end.无论出于什么原因,我都很难理解哪个是理想的,甚至很难理解 select 在一个实例中,以及下一个和下一个,直到我到达终点。

Thank you.谢谢你。

This is kind of an opinionated question.这是一个自以为是的问题。

I would likely use the simplest CSS Selector I can find that uniquely defines the element.我可能会使用我能找到的最简单的 CSS 选择器,它唯一地定义了元素。 XPath is slower and, I find, likely more brittle and harder to find good selectors for elements. XPath 速度较慢,而且我发现可能更脆弱,更难找到好的元素选择器。 But there is no "correct" approach.但是没有“正确”的方法。

I'm a little confused regarding the goal of the rest of the question.我对问题的 rest 的目标有点困惑。 I think we would need some more detail and the code you've used to attempt this.我认为我们需要更多细节以及您用来尝试此操作的代码。

Also, your HTML is formatted on one line and very hard to view.此外,您的 HTML 格式为一行,很难查看。

To find all cards use:要查找所有卡,请使用:

driver.find_elements_by_xpath("//div[contains(@class,'consumer-product-card__InViewContainer-ncbvk2-0 dWfGpk')]")

Then use as an example links I gave you in the previous question.然后使用我在上一个问题中给你的链接作为示例。

UPDATE更新

Solution to start with:开始的解决方案:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


driver = webdriver.Chrome(executable_path='/snap/bin/chromium.chromedriver')

driver.get('https://dutchie.com/embedded-menu/revolutionary-clinics-somerville/menu')

wait = WebDriverWait(driver, 30)
wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".product-card__Content-sc-7s6mw-1.cfcIOW")))
cards = driver.find_elements_by_css_selector(".product-card__Content-sc-7s6mw-1.cfcIOW")

data = []
for card in cards:
    name = card.find_element_by_css_selector(".product-information__TitleContainer-sc-65h5ke-3.fOoVwz.list-only").text
    data.append(name)
for i in data:
    print(i)

It waits for cards and prints their names.它等待卡片并打印他们的名字。 But scrolling etc, other elements are completely different questions.但是滚动等,其他元素是完全不同的问题。 I found css selectors more suitable for this case.我发现 css 选择器更适合这种情况。

Result is three items:结果是三个项目:

Rick Simpson Oil (RSO)
Live Sugar - Purple Pineapple Express
Live Sugar - Gelato #33

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM