简体   繁体   English

Selenium Web Scraping with Beautiful Soup 动态内容和隐藏数据表

[英]Selenium Web Scraping With Beautiful Soup on Dynamic Content and Hidden Data Table

Really need help from this community!真的需要这个社区的帮助!

I am doing web scraping on Dynamic Content in Python by using Selenium and Beautiful Soup.我正在使用 Selenium 和 Beautiful Soup 对 Python 中的动态内容进行网络抓取。 The thing is the pricing data table can not be parsed to Python, even though using the following code:问题是定价数据表无法解析为 Python,即使使用以下代码:

html=browser.execute_script('return document.body.innerHTML')
sel_soup=BeautifulSoup(html, 'html.parser')  

However, What I found later is that if I click on ' View All Prices' Button on the WebPage before using the above code, I can parse that data table into python.然而,后来我发现,如果我在使用上面的代码之前点击网页上的“查看所有价格”按钮,我可以将该数据表解析为python。

My Question would be How can I parse and get access to those hidden dynamic td tag info in my python without using Selenium to click on all the 'View All Prices' buttons, because there are so many.我的问题是如何在不使用 Selenium 单击所有“查看所有价格”按钮的情况下解析并访问我的 python 中那些隐藏的动态 td 标签信息,因为有很多。

The url for the website I am doing the Web Scraping on is https://www.cruisecritic.com/cruiseto/cruiseitineraries.cfm?port=122 , and the attached picture is the html in terms of the dynamic data table which I need.我正在做 Web Scraping 的网站的 url 是https://www.cruisecritic.com/cruiseto/cruiseitineraries.cfm?port=122 ,所附图片是我需要的动态数据表的 html . enter image description here在此处输入图片说明

Really appreciate the help from this community!真的很感谢这个社区的帮助!

You should target the element after has loaded and take arguments[0] and not the entire page via document您应该在加载后定位元素并通过document获取arguments[0]而不是整个页面

html_of_interest=driver.execute_script('return arguments[0].innerHTML',element)
sel_soup=BeautifulSoup(html_of_interest, 'html.parser')

This has 2 practical cases:这有两个实际案例:

1 1

the element is not yet loaded in the DOM and you need to wait for the element:该元素尚未加载到 DOM 中,您需要等待该元素:

browser.get("url")
sleep(experimental) # usually get will finish only after the page is loaded but sometimes there is some JS woo running after on load time

try:
    element= WebDriverWait(browser, delay).until(EC.presence_of_element_located((By.ID, 'your_id_of_interest')))
    print "element is ready do the thing!"
    html_of_interest=driver.execute_script('return arguments[0].innerHTML',element)
    sel_soup=BeautifulSoup(html_of_interest, 'html.parser')
except TimeoutException:
    print "Somethings wrong!"   

2 2

the element is in a shadow root and you need to expand first the shadow root, probably not your situation but I will mention it here since it is relevant for future reference.该元素位于影子根中,您需要先扩展影子根,可能不是您的情况,但我会在此处提及它,因为它与将来参考相关。 ex:例如:

import selenium
from selenium import webdriver
driver = webdriver.Chrome()
from bs4 import BeautifulSoup


def expand_shadow_element(element):
  shadow_root = driver.execute_script('return arguments[0].shadowRoot', element)
  return shadow_root

driver.get("chrome://settings")
root1 = driver.find_element_by_tag_name('settings-ui')

html_of_interest=driver.execute_script('return arguments[0].innerHTML',root1)
sel_soup=BeautifulSoup(html_of_interest, 'html.parser')
sel_soup# empty root not expande

shadow_root1 = expand_shadow_element(root1)

html_of_interest=driver.execute_script('return arguments[0].innerHTML',shadow_root1)
sel_soup=BeautifulSoup(html_of_interest, 'html.parser')
sel_soup

在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM