简体   繁体   中英

Can't find page elements using Selenium python

I am trying to extraxt the review text from this page .

Here's a condensed version of the html shown in my chrome browser inspector:

<div id="module_product_review" class="pdp-block module">
    <div class="lazyload-wrapper ">
        <div class="pdp-mod-review" data-spm="ratings_reviews" lazada_pdp_review="expose" itemid="1615006548" data-nosnippet="true" data-aplus-ae="x1_490e4591" data-spm-anchor-id="a2o42.pdp_revamp.0.ratings_reviews.508466b1OJjCoH">
            <div>...</div>
            <div>...</div>
            <div>
                <div class="mod-reviews">
                    <div class="item">
                        <div class="top">...</div>
                        <div class="middle">...</div>
                        <div class="item-content">
                            <div class="content" data-spm-anchor-id="a2o42.pdp_revamp.ratings_reviews.i3.508466b1OJjCoH">Slim and light. feel good. better if providing 16G version.</div>
                            <div class="review-image">...></div>
                            <div class="skuInfo">Color Family:MYSTIC SILVER</div>
                            <div class="bottom">...</div>
                            <div class="dialogs"></div>
                        </div>
                        <div class="seller-reply-wrapper">...</div>
                    <div class="item">...</div>
                    <div class="item">...</div>
                    <div class="item">...</div>
                    <div class="item">...</div>
                </div>
            </div>
        </div>
    </div>
</div>

I'm trying to extract the "Slim and light. feel good. better if providing 16G version." text from the class="content" element.

But when I try to retrieve the id="module_product_review" element using Selenium in python, this is what I get instead:

<div class="pdp-block module" id="module_product_review">
    <div class="lazyload-wrapper">
        <div class="lazy-load-placeholder">
            <div class="lazy-load-skeleton">
            </div>
        </div>
    </div>
</div>

This is my code:

op = webdriver.ChromeOptions()
op.add_argument('--headless')
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=op)
driver.get("https://www.lazada.sg/products/huawei-matebook-d14-laptop-14-fullview-display-intel-i5-processor-8gb512gb-intel-uhd-graphics-i1615006548-s7594078907.html?spm=a2o42.searchlist.list.3.15064828Od60kh&search=1&freeshipping=1")
module_product_review = driver.find_element(By.ID, "module_product_review")
html = module_product_review.get_attribute("outerHTML")
soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())

I thought it might have been because I was retrieving the element before it was fully loaded, so I tried to sleep the program for 30 seconds before calling find_element() , but I still get the same result. As far as I can tell, it's not an issue of iframes or shadow roots either.

Is there some other issue that I'm missing?

The element you are trying to access and to get it's text is initially out of the visible view. You have first to scroll that element into the view.
Also, since you are working in headless mode you should set the window size. The default window size in headless mode is much smaller than we normally use.
And you should use expected conditions explicit waits to access the elements only when they are ready for that.
This should work better:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains

op = webdriver.ChromeOptions()
op.add_argument('--headless')
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=op)
options.add_argument("window-size=1920,1080")
wait = WebDriverWait(driver, 20)
actions = ActionChains(driver)
driver.get("https://www.lazada.sg/products/huawei-matebook-d14-laptop-14-fullview-display-intel-i5-processor-8gb512gb-intel-uhd-graphics-i1615006548-s7594078907.html?spm=a2o42.searchlist.list.3.15064828Od60kh&search=1&freeshipping=1")
element = wait.until(EC.presence_of_element_located((By.ID, "module_product_review")))
time.sleep(1)
actions.move_to_element(element).perform()
module_product_review = wait.until(EC.visibility_of_element_located((By.ID, "module_product_review")))  
#now you can do what you want here
html = module_product_review.get_attribute("outerHTML")

Also, in order to find that specific element and get that specific text you could use something more precise, like this:

your_text = wait.until(EC.visibility_of_element_located((By.XPATH, "(//div[@id='module_product_review']//div[@class='item']//div[@class='content'])[1]"))).text

You can use this after scrolling, as mentioned above

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM