[英]Deriving text from Javascript webpage using Selenium
I am trying to extract the text "This station managed by the Delta Flow Projects Office", from this website: https://waterdata.usgs.gov/ca/nwis/uv?site_no=381504121404001 . 我试图从这个网站上提取文本“由Delta Flow Projects Office管理的这个站点”: https : //waterdata.usgs.gov/ca/nwis/uv?site_no = 381504121404001 。 This line is located under the div class
stationContainer
. 该行位于div类
stationContainer
。 Since this is a dynamic webpage, I'm using selenium to derive the html. 由于这是一个动态网页,我使用selenium来派生html。
This is the html from the website. 这是来自网站的HTML。
This is my code: 这是我的代码:
from selenium import webdriver
from selenium.webdriver.common.by import By
browser = webdriver.Chrome()
url = "https://waterdata.usgs.gov/ca/nwis/uv?site_no=381504121404001"
browser.get(url) #navigate to the page
innerHTML = browser.execute_script("return document.body.innerHTML")
elem = browser.find_elements_by_xpath("//div[@class='stationContainer']")
print (elem)
I get the this result from my print message: 我从打印消息中得到了这个结果:
selenium.webdriver.remote.webelement.WebElement (session="96fc124c0e2d1fd4cd86f61db272d52a", element="0.5862443940581294-1")
selenium.webdriver.remote.webelement.WebElement(session =“96fc124c0e2d1fd4cd86f61db272d52a”,element =“0.5862443940581294-1”)
I'm hoping to derive the text by searching through the div class, but it seems I'm not going about this the right way. 我希望通过搜索div类来推导文本,但似乎我没有以正确的方式解决这个问题。
elem
is a list not a string
. elem
是一个不是string
的列表。 Try this: 尝试这个:
elem = browser.find_elements_by_xpath("//div[@class='stationContainer']")[0]
print elem.text
That prints out all the content. 这打印出所有内容。 So you probably need a better selector or a way to parse the rest of it out.
所以你可能需要一个更好的选择器或解析其余的选择器。
print (elem.text)
elem
is a WebElement object, hence the printed message. elem
是一个WebElement对象,因此是打印的消息。 If you want to access the text, you need to add .text
to the end, or if you want to grab some other attribute you can do something like elem.get_attribute('innerHTML')
. 如果要访问文本,则需要在末尾添加
.text
,或者如果要获取其他属性,可以执行elem.get_attribute('innerHTML')
。
Also, since the div element has a lot of other text, you're going to be getting a lot more text than what you want. 此外,由于div元素有很多其他文本,你将获得比你想要的更多的文本。 I haven't looked into other similar pages, but perhaps you could extract what's between
</form>
and <br><br>
in the div's html. 我没有看过其他类似的页面,但也许你可以在div的html中提取
</form>
和<br><br>
之间的内容。
Well, the content you want to scrap is not actually dynamic. 那么,你要废弃的内容实际上并不是动态的。 You can use
bs4
to fetch the div class stationContainer
content. 您可以使用
bs4
来获取div class stationContainer
内容。 What makes this a bit challenging is that the element you're searching is not between certain tags. 这有点具有挑战性的是,您搜索的元素不在某些标记之间。 So a solution to this is an easy string manipulation to extract the content between the
</form>
and the <br/><br/>
tag, like so: 所以这个解决方案是一个简单的字符串操作来提取之间的内容
</form>
的和<br/><br/>
标签,如下所示:
from bs4 import BeautifulSoup
from requests import get
soup = BeautifulSoup(get('https://your_url_here').text, "html.parser")
for i in soup.find_all('div', attrs={'class':"stationContainer"}):
print str(i).split('</form>')[1].split('<br/><br/>')[0].strip()
This code produces the appropriate result! 此代码产生适当的结果!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.