简体   繁体   English

(Python)从带有'style:hidden'标签的网站中收集数据吗?

[英](Python) Scraping data from a website with 'style:hidden' tags?

I'm using Selenium to try and get data from a website. 我正在使用Selenium尝试从网站获取数据。 But the data I want is stored in 'hidden' tags, so it's not visible when I pull the source. 但是我想要的数据存储在“隐藏”标签中,因此当我提取源文件时它不可见。 Is there any way to get around this? 有什么办法可以解决这个问题? Are there different types of hidden? 有不同类型的隐藏物吗?

I presume it's hidden because I'm also using Firebug, which can see the source on the page I'm trying to scrape, but it 'greys-out' that source, which I've read is an indication of that source being hidden with the style:hidden tag. 我认为它是隐藏的,因为我也在使用Firebug,它可以在我要抓取的页面上看到源,但是它“灰显”了该源,我读过这表明该源已隐藏带有style:hidden标签。

What is probably happening is that the Web site is loading the additional data through JavaScript and/or XMLHttpRequest or through CSS. 可能发生的情况是该网站正在通过JavaScript和/或XMLHttpRequest或通过CSS加载其他数据。 Firebug shows you the DOM once it has been completed. 完成后,Firebug将向您显示DOM。 With a Webdriver you can pilot the browser in loading a page and interact with it. 使用Webdriver,您可以引导浏览器加载页面并与之交互。 The issue comes when some of the additional information is given once there is a specific user interaction. 问题是,一旦发生特定的用户交互,就会给出一些附加信息。 So a way to mitigate this would be to use webdriver to pilot the browser and do the same sequence of actions so that the DOM will change accordingly. 因此,减轻这种情况的一种方法是使用网络驱动程序来引导浏览器并执行相同的操作序列,以便DOM会相应地更改。

You might want to play with the CSS to change the properties and make the element visible too. 您可能想使用CSS来更改属性并使元素也可见。

Given that you didn't provide any code examples of what you are trying to do, it is not realistic to precisely help you. 鉴于您没有提供要执行的操作的任何代码示例,因此无法准确地帮助您。 But you will find plenty of webdriver code examples in python in the official documentation. 但是,您会在官方文档中找到许多在python中的webdriver代码示例

One of my specific reasons for scraping with Selenium is to make sure the javascript created parts of each page are fully rendered before I start searching for content. 我使用Selenium进行抓取的具体原因之一是,在开始搜索内容之前,请确保每个页面的javascript创建部分均已完全呈现。 I use this line to wait for the content I want to be loaded: 我使用此行来等待要加载的内容:

WebDriverWait(self.driver, 30).until(EC.presence_of_element_located((By.XPATH, my_xpath)))

The '30' is a 30 second wait timer, if this is exceeded then a TimeoutException occurs so you will want to put it in a try ... except: block. “ 30”是一个30秒的等待计时器,如果超过了该计时器,则会发生TimeoutException,因此您需要try ... except:将其放入try ... except:块。 Change my_xpath to match the tags you want. 更改my_xpath以匹配所需的标签。 Even if the style is marked as hidden, Selenium can still see it. 即使样式被标记为隐藏,Selenium仍然可以看到它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM