使用xpath从html提取嵌套文本

Question

I'm trying to extract textual content from a html page that looks something like this: 我正在尝试从html页面中提取文本内容，如下所示：

<div class="content">
    <div class="section">
      Lorem <a href="..." class="link">ipsum</a> 
      dolor <a href="..." class="link">sit</a> amet, 
      consectetur <a href="..." class="link">adipiscing</a> elit
    </div>

    <div class="section">
      sed do <a href="..." class="link">eiusmod</a> tempor 
      incididunt <a href="..." class="link">ut</a> labore 
      et <a href="..." class="link">dolore</a>
    </div>
</div>

I just want to extract the text portion: 我只想提取文本部分：

Lorem ipsum dolor amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore

My XPath (2.0) expression is //*[contains(@class, 'section')] . 我的XPath（2.0）表达式是//*[contains(@class, 'section')] 。 When I evaluate it using javax.xml.xpath.XPathExpression , I only retrieve the text that's outside the links: 当我使用javax.xml.xpath.XPathExpression评估它时，我只检索链接之外的文本：

Lorem dolor amet, consectetur elit, sed do tempor incididunt labore et

I haven't used XPath before - is there a better expression to extract the full text? 我以前没有使用过XPath-是否有更好的表达式来提取全文？ thanks. 谢谢。

Answer 1

Your expression returns a complete XML element. 您的表达式返回完整的XML元素。 Your processor then returns this as string by converting a the XML element to a text, so basically the same as you would have executed 然后，您的处理器通过将XML元素转换为文本将其作为字符串返回，因此基本上与您执行的相同

//*[contains(@class, 'section')]/text()

In contrast, you can get all text elements also in the children by using the string() function: 相反，您也可以使用string()函数在子级中获取所有文本元素：

//*[contains(@class, 'section')]/string()

Another way, as pointed out by Mathias Müller in the comments, would be to use 正如MathiasMüller在评论中指出的，另一种方式是使用

//*[contains(@class, 'section')]//text()

which returns all descendant-or-self text elements. 返回所有后代或自身的文本元素。

使用xpath从html提取嵌套文本

问题描述

1 个解决方案

解决方案1
3 已采纳 2015-01-16 09:13:47

使用xpath从html提取嵌套文本

问题描述

1 个解决方案

解决方案1 3 已采纳 2015-01-16 09:13:47

解决方案1
3 已采纳 2015-01-16 09:13:47