简体   繁体   English

使用Selenium从Javascript网页派生文本

[英]Deriving text from Javascript webpage using Selenium

I am trying to extract the text "This station managed by the Delta Flow Projects Office", from this website: https://waterdata.usgs.gov/ca/nwis/uv?site_no=381504121404001 . 我试图从这个网站上提取文本“由Delta Flow Projects Office管理的这个站点”: https//waterdata.usgs.gov/ca/nwis/uv?site_no = 381504121404001 This line is located under the div class stationContainer . 该行位于div类stationContainer Since this is a dynamic webpage, I'm using selenium to derive the html. 由于这是一个动态网页,我使用selenium来派生html。

This is the html from the website. 这是来自网站的HTML。

IMG

This is my code: 这是我的代码:

from selenium import webdriver
from selenium.webdriver.common.by import By

browser = webdriver.Chrome()
url = "https://waterdata.usgs.gov/ca/nwis/uv?site_no=381504121404001"
browser.get(url) #navigate to the page
innerHTML = browser.execute_script("return document.body.innerHTML")
elem = browser.find_elements_by_xpath("//div[@class='stationContainer']")

print (elem)

I get the this result from my print message: 我从打印消息中得到了这个结果:

selenium.webdriver.remote.webelement.WebElement (session="96fc124c0e2d1fd4cd86f61db272d52a", element="0.5862443940581294-1") selenium.webdriver.remote.webelement.WebElement(session =“96fc124c0e2d1fd4cd86f61db272d52a”,element =“0.5862443940581294-1”)

I'm hoping to derive the text by searching through the div class, but it seems I'm not going about this the right way. 我希望通过搜索div类来推导文本,但似乎我没有以正确的方式解决这个问题。

elem is a list not a string . elem是一个不是string的列表。 Try this: 尝试这个:

elem = browser.find_elements_by_xpath("//div[@class='stationContainer']")[0]
print elem.text

That prints out all the content. 这打印出所有内容。 So you probably need a better selector or a way to parse the rest of it out. 所以你可能需要一个更好的选择器或解析其余的选择器。

print (elem.text)

elem is a WebElement object, hence the printed message. elem是一个WebElement对象,因此是打印的消息。 If you want to access the text, you need to add .text to the end, or if you want to grab some other attribute you can do something like elem.get_attribute('innerHTML') . 如果要访问文本,则需要在末尾添加.text ,或者如果要获取其他属性,可以执行elem.get_attribute('innerHTML')

Also, since the div element has a lot of other text, you're going to be getting a lot more text than what you want. 此外,由于div元素有很多其他文本,你将获得比你想要的更多的文本。 I haven't looked into other similar pages, but perhaps you could extract what's between </form> and <br><br> in the div's html. 我没有看过其他类似的页面,但也许你可以在div的html中提取</form><br><br>之间的内容。

Well, the content you want to scrap is not actually dynamic. 那么,你要废弃的内容实际上并不是动态的。 You can use bs4 to fetch the div class stationContainer content. 您可以使用bs4来获取div class stationContainer内容。 What makes this a bit challenging is that the element you're searching is not between certain tags. 这有点具有挑战性的是,您搜索的元素不在某些标记之间。 So a solution to this is an easy string manipulation to extract the content between the </form> and the <br/><br/> tag, like so: 所以这个解决方案是一个简单的字符串操作来提取之间的内容</form>的和<br/><br/>标签,如下所示:

from bs4 import BeautifulSoup
from requests import get

soup = BeautifulSoup(get('https://your_url_here').text, "html.parser")

for i in soup.find_all('div', attrs={'class':"stationContainer"}):
    print str(i).split('</form>')[1].split('<br/><br/>')[0].strip()

This code produces the appropriate result! 此代码产生适当的结果!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM