繁体   English   中英

使用 Python 和 Selenium 抓取动态内容时选择适当的定位器

[英]Choosing appropriate locators when scraping dynamic content with Python and Selenium

I am trying to understand the correct way to select specific elements of a webpage using python and selenium, I am uncertain what dictates which approach to take such as xpath or CSS and so on.

https://dutchie.com/embedded-menu/revolutionary-clinics-somerville/menu

 <a class="consumer-product-card__StyledLink-ncbvk2-1 jpGhIo" href="/embedded-menu/berkshire-roots/menu/cbd-tincture-2-1-225mg"> <span>CBD Tincture 2:1 225mg Details</span> <div class="product-card__Container-sc-7s6mw-0 iWHVJj"> <div class="product-card__Content-sc-7s6mw-1 cfcIOW"> <div class="product-information__Container-sc-65h5ke-0 ejVwks"> <img class="product-information__StyledProductImage-sc-65h5ke-1 jupjtQ" width="218" height="218" src="https://images.dutchie.com/0f253b35120facc1465b75b08bfd4d66?auto=format&amp;dpr=1&amp;bg=FFFFFF&amp;crop=faces&amp;fit=fill&amp;w=218&amp;h=218&amp;ixlib=react-7.2.0" alt="" srcset="https://images.dutchie.com/0f253b35120facc1465b75b08bfd4d66?auto=format&amp;dpr=2&amp;bg=FFFFFF&amp;crop=faces&amp;fit=fill&amp;w=218&amp;h=218&amp;ixlib=react-7.2.0 2x, https://images.dutchie.com/0f253b35120facc1465b75b08bfd4d66?auto=format&amp;dpr=3&amp;bg=FFFFFF&amp;crop=faces&amp;fit=fill&amp;w=218&amp;h=218&amp;ixlib=react-7.2.0 3x"> <div class="product-information__ProductInfo-sc-65h5ke-2 bwhblJ"> <div class="product-information__Price-sc-65h5ke-7 eEqLUB">$36.95</div> <div class="product-information__BrandContainer-sc-65h5ke-5 dlSlvE list-only"> <div class="product-information__Brand-sc-65h5ke-6 ftehWE">Berkshire Roots</div> </div> <div class="product-information__TitleContainer-sc-65h5ke-3 fOoVwz list-only false"> <div class="product-information__Title-sc-65h5ke-4 eBIyJW --line2">CBD Tincture 2:1 225mg</div> </div> <div class="product-information__TitleContainer-sc-65h5ke-3 fOoVwz mobile-and-card"> <div class="product-information__Title-sc-65h5ke-4 eBIyJW">CBD Tincture 2:1</div> <div class="product-information__Title-sc-65h5ke-4 eBIyJW --line2"> 225mg</div> </div> <div class="product-information__DetailsContainer-sc-65h5ke-9 ifqkuO"> <div class="product-information__Strain-sc-65h5ke-10 eWkod --high-cbd">High CBD</div> <div class="product-information__PotencyInfo-sc-65h5ke-14 gUReQf"><b>THC:&nbsp;</b>72.3 mg&nbsp;&nbsp;|&nbsp;&nbsp;<b>CBD:&nbsp;</b>160.3 mg</div> </div> </div> </div> <div class="product-weights__Container-nwgli1-0 gwUwAi"> <div class="product-weights__Weights-nwgli1-1 kiObrJ"> <div aria-label="Add 0.41g to cart for $36.95" data-cy="product-card-weight" class="weight__Container-sc-11f1l3-2 dNvnhd"> <div class="weight__Price-sc-11f1l3-4 ZtHqz">$36.95</div> <div class="weight__IconContainer-sc-11f1l3-1 zqIJt"> <svg xmlns="http://www.w3.org/2000/svg" width="11" height="11" viewBox="0 0 10 10"> <path fill="#A6ACB3" fill-rule="nonzero" d="M9.176 5c0-.407-.031-.723-.438-.723l-3.022.007.007-3.022c0-.407-.326-.428-.722-.438-.407 0-.723.03-.722.436l.003 3.012-3.022.007c-.406 0-.426.325-.436.722-.01.396.031.722.438.722l3.022-.007.003 3.012c0.407.326.427.723.438.407 0.722-.03.721-.437l-.003-3.011 3.012.003c.406 0.437-.315.436-.722z"></path> </svg> </div> </div> <div class="product-weights__Fill-nwgli1-2 dtfdkt"></div> </div> </div> </div> </div> </a>

如何在不滚动到页面底部的情况下使用各种循环来访问每个“消费者产品卡”? 还是我需要强制页面先滚动? “消费者产品卡”方法是否正确,还是 xpath 更有意义? 无论出于什么原因,我都很难理解哪个是理想的,甚至很难理解 select 在一个实例中,以及下一个和下一个,直到我到达终点。

谢谢你。

这是一个自以为是的问题。

我可能会使用我能找到的最简单的 CSS 选择器,它唯一地定义了元素。 XPath 速度较慢,而且我发现可能更脆弱,更难找到好的元素选择器。 但是没有“正确”的方法。

我对问题的 rest 的目标有点困惑。 我认为我们需要更多细节以及您用来尝试此操作的代码。

此外,您的 HTML 格式为一行,很难查看。

要查找所有卡,请使用:

driver.find_elements_by_xpath("//div[contains(@class,'consumer-product-card__InViewContainer-ncbvk2-0 dWfGpk')]")

然后使用我在上一个问题中给你的链接作为示例。

更新

开始的解决方案:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


driver = webdriver.Chrome(executable_path='/snap/bin/chromium.chromedriver')

driver.get('https://dutchie.com/embedded-menu/revolutionary-clinics-somerville/menu')

wait = WebDriverWait(driver, 30)
wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".product-card__Content-sc-7s6mw-1.cfcIOW")))
cards = driver.find_elements_by_css_selector(".product-card__Content-sc-7s6mw-1.cfcIOW")

data = []
for card in cards:
    name = card.find_element_by_css_selector(".product-information__TitleContainer-sc-65h5ke-3.fOoVwz.list-only").text
    data.append(name)
for i in data:
    print(i)

它等待卡片并打印他们的名字。 但是滚动等,其他元素是完全不同的问题。 我发现 css 选择器更适合这种情况。

结果是三个项目:

Rick Simpson Oil (RSO)
Live Sugar - Purple Pineapple Express
Live Sugar - Gelato #33

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM