如何使用selenium获取特定元素的html源代码？

Question

The page I'm looking at contains : 我正在查看的页面包含：

<div id='1'> <p> text 1 <h1> text 2 </h1> text 3 <p> text 4 </p> </p> </div>

I want to get all the text in the div, except for the text that is in the <h> . 我想获取div中的所有文本，除了<h>的文本。 (I want to get "text 1","text 3" and "text 4") There may be a few <h> elements, or none at all. （我想得到“文本1”，“文本3”和“文本4”）可能有一些<h>元素，或者根本没有。 And there may be a few <p> elements, even one inside the other, or none. 并且可能有一些<p>元素，甚至一个在另一个内部，或者没有。

I thought to do this by getting all the html source of the div, and using a regex to remove the <h> elements. 我想通过获取div的所有html源代码并使用正则表达式删除<h>元素来实现此目的。 But selenium.get_text does not return the html, just the text (all of it!). 但selenium.get_text不会返回html，只返回文本（全部！）。

I know I can use selenium.get_html_source and then look for the element I need with a regex, but that looks like a waste since selenium knows how to find the element. 我知道我可以使用selenium.get_html_source ，然后用正则表达式查找我需要的元素，但这看起来很浪费，因为selenium知道如何找到元素。

Does anyone have a better solution? 有没有人有更好的解决方案？ Thanks :) 谢谢：）

Answer 1

The following code will give you the HTML in the div element: 以下代码将为您提供div元素中的HTML：

sel = selenium('localhost', 4444, browser, my_url)
html = sel.get_eval("this.browserbot.getCurrentWindow().document.getElementById('1').innerHTML")

then you can use BeautifulSoup to parse it and extract what you really want. 然后你可以使用BeautifulSoup来解析它并提取你真正想要的东西。

I hope it helps 我希望它有所帮助

Answer 2

Use xpath. 使用xpath。 From selenium.py : 来自selenium.py ：

Without an explicit locator prefix, Selenium uses the following default strategies: 如果没有明确的定位器前缀，Selenium将使用以下默认策略：

\\**dom**\\ , for locators starting with "document." \\ ** dom ** \\，用于以“document”开头的定位器。

\\**xpath**\\ , for locators starting with "//" \\ ** xpath ** \\，用于以“//”开头的定位器

\\**identifier**\\ , otherwise \\ **标识符** \\，否则

In your case, you could try 在你的情况下，你可以尝试

selenium.get_text("//div[@id='1']/descendant::*[not(self::h1)]")

You can learn more about xpath here . 您可以在此处了解有关xpath的更多信息。

PS I don't know if there's good HTML documentation available for python-selenium, but I haven't found any; PS我不知道是否有可用于python-selenium的良好HTML文档，但我还没有找到; on the other hand, the docstrings of the selenium.py file seem to constitute comprehensive documentation. 另一方面， selenium.py文件的文档字符串似乎构成了全面的文档。 So I'd suggest looking up the source to get a better understanding of how it works. 因此，我建议查找源代码以更好地了解其工作原理。

Answer 3

What about using jQuery? 那么使用jQuery呢？

Edit: 编辑：

First you have to add the required .JS files, for that go to www.jQuery.com. 首先，您必须添加所需的.JS文件，然后转到www.jQuery.com。

Then all you need to do is call a simple jQuery selector: 然后你需要做的就是调用一个简单的jQuery选择器：

alert($("div#1").html());

Answer 4

The selected answer does not work in Python 3 at the time of writing. 在撰写本文时，所选答案在Python 3中不起作用。 Instead use this: 而是使用这个：

from selenium import webdriver

wd = webdriver.Firefox()
wd.get(url)
return wd.execute_script('return window.document.getElementById('1').innerHTML')

如何使用selenium获取特定元素的html源代码？

问题描述

4 个解决方案

解决方案1
9 已采纳 2009-11-29 20:48:21

解决方案2
4 2009-11-29 18:14:55

解决方案3
1 2009-11-29 18:07:07

解决方案4
0 2016-03-06 07:46:42

如何使用selenium获取特定元素的html源代码？

问题描述

4 个解决方案

解决方案1 9 已采纳 2009-11-29 20:48:21

解决方案2 4 2009-11-29 18:14:55

解决方案3 1 2009-11-29 18:07:07

解决方案4 0 2016-03-06 07:46:42

解决方案1
9 已采纳 2009-11-29 20:48:21

解决方案2
4 2009-11-29 18:14:55

解决方案3
1 2009-11-29 18:07:07

解决方案4
0 2016-03-06 07:46:42