简体   繁体   English

无法使用 Selenium python 找到页面元素

[英]Can't find page elements using Selenium python

I am trying to extraxt the review text from this page .我正在尝试从此页面中提取评论文本。

Here's a condensed version of the html shown in my chrome browser inspector:这是我的 chrome 浏览器检查器中显示的 html 的精简版:

<div id="module_product_review" class="pdp-block module">
    <div class="lazyload-wrapper ">
        <div class="pdp-mod-review" data-spm="ratings_reviews" lazada_pdp_review="expose" itemid="1615006548" data-nosnippet="true" data-aplus-ae="x1_490e4591" data-spm-anchor-id="a2o42.pdp_revamp.0.ratings_reviews.508466b1OJjCoH">
            <div>...</div>
            <div>...</div>
            <div>
                <div class="mod-reviews">
                    <div class="item">
                        <div class="top">...</div>
                        <div class="middle">...</div>
                        <div class="item-content">
                            <div class="content" data-spm-anchor-id="a2o42.pdp_revamp.ratings_reviews.i3.508466b1OJjCoH">Slim and light. feel good. better if providing 16G version.</div>
                            <div class="review-image">...></div>
                            <div class="skuInfo">Color Family:MYSTIC SILVER</div>
                            <div class="bottom">...</div>
                            <div class="dialogs"></div>
                        </div>
                        <div class="seller-reply-wrapper">...</div>
                    <div class="item">...</div>
                    <div class="item">...</div>
                    <div class="item">...</div>
                    <div class="item">...</div>
                </div>
            </div>
        </div>
    </div>
</div>

I'm trying to extract the "Slim and light. feel good. better if providing 16G version."我正在尝试提取“轻薄。感觉良好。如果提供16G版本则更好。” text from the class="content" element.来自class="content"元素的文本。

But when I try to retrieve the id="module_product_review" element using Selenium in python, this is what I get instead:但是,当我尝试使用 python 中的 Selenium 检索id="module_product_review"元素时,这就是我得到的:

<div class="pdp-block module" id="module_product_review">
    <div class="lazyload-wrapper">
        <div class="lazy-load-placeholder">
            <div class="lazy-load-skeleton">
            </div>
        </div>
    </div>
</div>

This is my code:这是我的代码:

op = webdriver.ChromeOptions()
op.add_argument('--headless')
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=op)
driver.get("https://www.lazada.sg/products/huawei-matebook-d14-laptop-14-fullview-display-intel-i5-processor-8gb512gb-intel-uhd-graphics-i1615006548-s7594078907.html?spm=a2o42.searchlist.list.3.15064828Od60kh&search=1&freeshipping=1")
module_product_review = driver.find_element(By.ID, "module_product_review")
html = module_product_review.get_attribute("outerHTML")
soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())

I thought it might have been because I was retrieving the element before it was fully loaded, so I tried to sleep the program for 30 seconds before calling find_element() , but I still get the same result.我认为这可能是因为我在元素完全加载之前检索它,所以我尝试在调用find_element()之前让程序休眠 30 秒,但我仍然得到相同的结果。 As far as I can tell, it's not an issue of iframes or shadow roots either.据我所知,这也不是 iframe 或影子根的问题。

Is there some other issue that I'm missing?我还缺少其他一些问题吗?

The element you are trying to access and to get it's text is initially out of the visible view.您尝试访问并获取其文本的元素最初不在可见视图中。 You have first to scroll that element into the view.您必须先将该元素滚动到视图中。
Also, since you are working in headless mode you should set the window size.此外,由于您在无头模式下工作,因此您应该设置 window 大小。 The default window size in headless mode is much smaller than we normally use. headless 模式下默认的 window 大小比我们平时使用的要小很多。
And you should use expected conditions explicit waits to access the elements only when they are ready for that.并且您应该使用预期条件显式等待仅在元素准备好时才访问这些元素。
This should work better:这应该更好用:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains

op = webdriver.ChromeOptions()
op.add_argument('--headless')
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=op)
options.add_argument("window-size=1920,1080")
wait = WebDriverWait(driver, 20)
actions = ActionChains(driver)
driver.get("https://www.lazada.sg/products/huawei-matebook-d14-laptop-14-fullview-display-intel-i5-processor-8gb512gb-intel-uhd-graphics-i1615006548-s7594078907.html?spm=a2o42.searchlist.list.3.15064828Od60kh&search=1&freeshipping=1")
element = wait.until(EC.presence_of_element_located((By.ID, "module_product_review")))
time.sleep(1)
actions.move_to_element(element).perform()
module_product_review = wait.until(EC.visibility_of_element_located((By.ID, "module_product_review")))  
#now you can do what you want here
html = module_product_review.get_attribute("outerHTML")

Also, in order to find that specific element and get that specific text you could use something more precise, like this:此外,为了找到特定元素并获取特定文本,您可以使用更精确的内容,如下所示:

your_text = wait.until(EC.visibility_of_element_located((By.XPATH, "(//div[@id='module_product_review']//div[@class='item']//div[@class='content'])[1]"))).text

You can use this after scrolling, as mentioned above你可以在滚动后使用它,如上所述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM