简体   繁体   中英

(Python) Scraping data from a website with 'style:hidden' tags?

I'm using Selenium to try and get data from a website. But the data I want is stored in 'hidden' tags, so it's not visible when I pull the source. Is there any way to get around this? Are there different types of hidden?

I presume it's hidden because I'm also using Firebug, which can see the source on the page I'm trying to scrape, but it 'greys-out' that source, which I've read is an indication of that source being hidden with the style:hidden tag.

What is probably happening is that the Web site is loading the additional data through JavaScript and/or XMLHttpRequest or through CSS. Firebug shows you the DOM once it has been completed. With a Webdriver you can pilot the browser in loading a page and interact with it. The issue comes when some of the additional information is given once there is a specific user interaction. So a way to mitigate this would be to use webdriver to pilot the browser and do the same sequence of actions so that the DOM will change accordingly.

You might want to play with the CSS to change the properties and make the element visible too.

Given that you didn't provide any code examples of what you are trying to do, it is not realistic to precisely help you. But you will find plenty of webdriver code examples in python in the official documentation.

One of my specific reasons for scraping with Selenium is to make sure the javascript created parts of each page are fully rendered before I start searching for content. I use this line to wait for the content I want to be loaded:

WebDriverWait(self.driver, 30).until(EC.presence_of_element_located((By.XPATH, my_xpath)))

The '30' is a 30 second wait timer, if this is exceeded then a TimeoutException occurs so you will want to put it in a try ... except: block. Change my_xpath to match the tags you want. Even if the style is marked as hidden, Selenium can still see it.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM