[英]Extracting nested text from html using xpath
I'm trying to extract textual content from a html page that looks something like this: 我正在尝试从html页面中提取文本内容,如下所示:
<div class="content">
<div class="section">
Lorem <a href="..." class="link">ipsum</a>
dolor <a href="..." class="link">sit</a> amet,
consectetur <a href="..." class="link">adipiscing</a> elit
</div>
<div class="section">
sed do <a href="..." class="link">eiusmod</a> tempor
incididunt <a href="..." class="link">ut</a> labore
et <a href="..." class="link">dolore</a>
</div>
</div>
I just want to extract the text portion: 我只想提取文本部分:
Lorem ipsum dolor amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore
My XPath (2.0) expression is //*[contains(@class, 'section')]
. 我的XPath(2.0)表达式是
//*[contains(@class, 'section')]
。 When I evaluate it using javax.xml.xpath.XPathExpression
, I only retrieve the text that's outside the links: 当我使用
javax.xml.xpath.XPathExpression
评估它时,我只检索链接之外的文本:
Lorem dolor amet, consectetur elit, sed do tempor incididunt labore et
I haven't used XPath before - is there a better expression to extract the full text? 我以前没有使用过XPath-是否有更好的表达式来提取全文? thanks.
谢谢。
Your expression returns a complete XML element. 您的表达式返回完整的XML元素。 Your processor then returns this as string by converting a the XML element to a text, so basically the same as you would have executed
然后,您的处理器通过将XML元素转换为文本将其作为字符串返回,因此基本上与您执行的相同
//*[contains(@class, 'section')]/text()
In contrast, you can get all text elements also in the children by using the string()
function: 相反,您也可以使用
string()
函数在子级中获取所有文本元素:
//*[contains(@class, 'section')]/string()
Another way, as pointed out by Mathias Müller in the comments, would be to use 正如MathiasMüller在评论中指出的,另一种方式是使用
//*[contains(@class, 'section')]//text()
which returns all descendant-or-self text elements. 返回所有后代或自身的文本元素。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.