JSoup-按标签获取标签之间的文本

Question

Scenario: I used Apache Tika to get XHTML from a DOCX file. 场景：我使用Apache Tika从DOCX文件中获取XHTML 。 I need to parse this XHTML to get text between particular tags (eg div or p tag). 我需要解析这个XHTML来获取特定标签之间的文本（例如div或p标签）。 For this I use Jsoup to get text between tags. 为此，我使用Jsoup在标签之间获取文本。

Problem: Originally the XHTML has this text: 问题：最初XHTML有这样的文字：

some text [tab-space][tab-space] other text.

But with Jsoup i am getting this: 但是对于Jsoup我得到了这个：

some text other text.

So the tag spaces are missing but i need to get the text as is ie including tag-spaces . 所以标签空间丢失但我需要按原样获取文本，即包括tag-spaces 。 Is it possible to do this using Jsoup or is there any other Java library to do so? 是否可以使用Jsoup执行此操作，还是有任何其他Java库可以执行此操作？

Answer 1

Use the getWholeText method for TextNodes: https://jsoup.org/apidocs/org/jsoup/nodes/TextNode.html#getWholeText-- 对TextNodes使用getWholeText方法： https ：//jsoup.org/apidocs/org/jsoup/nodes/TextNode.html#getWholeText--

final Document doc = Jsoup.parse(new File(".\\source.xhtml"), "UTF-8");

for (Element result : doc.select("div")) {
    final String text = ((TextNode) result.childNode(0)).getWholeText();
    System.out.println(text);
}

JSoup-按标签获取标签之间的文本

问题描述

1 个解决方案

解决方案1
5 已采纳 2016-05-19 16:05:29

JSoup-按标签获取标签之间的文本

问题描述

1 个解决方案

解决方案1 5 已采纳 2016-05-19 16:05:29

解决方案1
5 已采纳 2016-05-19 16:05:29