简体   繁体   English

如何在 Selenium WebDriver 中获取元素的文本,而不包含子元素文本?

[英]How to get text of an element in Selenium WebDriver, without including child element text?

<div id="a">This is some
   <div id="b">text</div>
</div>

Getting "This is some" is non-trivial.获得“这是一些”并非易事。 For instance, this returns "This is some text":例如,这将返回“这是一些文本”:

driver.find_element_by_id('a').text

How does one, in a general way, get the text of a specific element without including the text of it's children?一般情况下,如何获取特定元素的文本而不包括其子元素的文本?

(I'm providing an answer below but will leave the question open in case someone can come up with a less hideous solution). (我在下面提供了一个答案,但会留下这个问题,以防有人能想出一个不那么可怕的解决方案)。

Here's a general solution:这是一个通用的解决方案:

def get_text_excluding_children(driver, element):
    return driver.execute_script("""
    return jQuery(arguments[0]).contents().filter(function() {
        return this.nodeType == Node.TEXT_NODE;
    }).text();
    """, element)

The element passed to the function can be something obtained from the find_element...() methods (ie it can be a WebElement object).传递给函数的元素可以是从find_element...()方法获得的东西(即它可以是WebElement对象)。

Or if you don't have jQuery or don't want to use it you can replace the body of the function above above with this:或者,如果您没有 jQuery 或不想使用它,您可以将上面的函数体替换为:

return self.driver.execute_script("""
var parent = arguments[0];
var child = parent.firstChild;
var ret = "";
while(child) {
    if (child.nodeType === Node.TEXT_NODE)
        ret += child.textContent;
    child = child.nextSibling;
}
return ret;
""", element) 

I'm actually using this code in a test suite.我实际上是在测试套件中使用此代码。

In the HTML which you have shared:在您共享的 HTML 中:

<div id="a">This is some
   <div id="b">text</div>
</div>

The text This is some is within atext node .文本This is some位于文本节点内 To depict the text node in a structured way:以结构化方式描述文本节点

<div id="a">
    This is some
   <div id="b">text</div>
</div>

This Usecase这个用例

To extract and print the text This is some from the text node using Selenium 's client you have 2 ways as follows:提取和打印文本This is some使用Selenium客户端的文本节点中的This is some文本,您有两种方法如下:

  • Using splitlines() : You can identify the parent element ie <div id="a"> , extract the innerHTML and then use splitlines() as follows:使用splitlines() :您可以识别父元素,即<div id="a"> ,提取innerHTML ,然后使用splitlines()如下:

    • using xpath :使用xpath

       print(driver.find_element_by_xpath("//div[@id='a']").get_attribute("innerHTML").splitlines()[0])
    • using xpath :使用xpath

       print(driver.find_element_by_css_selector("div#a").get_attribute("innerHTML").splitlines()[0])
  • Using execute_script() : You can also use the execute_script() method which can synchronously execute JavaScript in the current window/frame as follows:使用execute_script() :您还可以使用execute_script()方法,该方法可以在当前窗口/框架中同步执行 JavaScript,如下所示:

    • using xpath and firstChild :使用xpathfirstChild

       parent_element = driver.find_element_by_xpath("//div[@id='a']") print(driver.execute_script('return arguments[0].firstChild.textContent;', parent_element).strip())
    • using xpath and childNodes[n] :使用xpathchildNodes[n]

       parent_element = driver.find_element_by_xpath("//div[@id='a']") print(driver.execute_script('return arguments[0].childNodes[1].textContent;', parent_element).strip())
def get_true_text(tag):
    children = tag.find_elements_by_xpath('*')
    original_text = tag.text
    for child in children:
        original_text = original_text.replace(child.text, '', 1)
    return original_text

You don't have to do a replace, you can get the length of the children text and subtract that from the overall length, and slice into the original text.您不必进行替换,您可以获取子文本的长度并将其从总长度中减去,然后切入原始文本。 That should be substantially faster.那应该快得多。

Unfortunately, Selenium was only built to work with Elements , not Text nodes.不幸的是,Selenium 只能与Elements 一起使用,而不是Text节点。

If you try to use a function like get_element_by_xpath to target the text nodes, Selenium will throw an InvalidSelectorException .如果您尝试使用像get_element_by_xpath这样的函数来定位文本节点,Selenium 将抛出InvalidSelectorException

One workaround is to grab the relevant HTML with Selenium and then use an HTML parsing library like BeautifulSoup that can handle text nodes more elegantly.一种解决方法是使用 Selenium 获取相关的 HTML,然后使用像 BeautifulSoup 这样的 HTML 解析库可以更优雅地处理文本节点。

import bs4
from bs4 import BeautifulSoup

inner_html = driver.find_elements_by_css_selector('#a')[0].get_attribute("innerHTML")
inner_soup = BeautifulSoup(inner_html, 'html.parser')

outer_html = driver.find_elements_by_css_selector('#a')[0].get_attribute("outerHTML")
outer_soup = BeautifulSoup(outer_html, 'html.parser')

From there, there are several ways to search for the Text content.从那里,有几种方法可以搜索文本内容。 You'll have to experiment to see what works best for your use case.您必须进行试验,看看什么最适合您的用例。

Here's a simple one-liner that may be sufficient:这是一个简单的单行代码,可能就足够了:

inner_soup.find(text=True)

If that doesn't work, then you can loop through the element's child nodes with .contents() and check their object type.如果这不起作用,那么您可以使用 .contents() 遍历元素的子节点并检查它们的对象类型。

BeautifulSoup has four types of elements , and the one that you'll be interested in is the NavigableString type, which is produced by Text nodes. BeautifulSoup 有四种类型的元素,你会感兴趣的一种是NavigableString类型,它是由 Text 节点产生的。 By contrast, Elements will have a type of Tag .相比之下, Elements 将具有Tag类型。

contents = inner_soup.contents

for bs4_object in contents:

    if (type(bs4_object) == bs4.Tag):
        print("This object is an Element.")

    elif (type(bs4_object) == bs4.NavigableString):
        print("This object is a Text node.")

Note that BeautifulSoup doesn't support Xpath expressions.请注意, BeautifulSoup 不支持 Xpath 表达式。 If you need those, then you can use some of the workarounds in this thread .如果您需要这些,那么您可以使用此线程中的一些解决方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM