使用jsoup解析html时避免删除空格和换行符

Question

我有一个示例代码如下。

String sample = "<html>
<head>
</head>
<body>
This is a sample on              parsing html body using jsoup
This is a sample on              parsing html body using jsoup
</body>
</html>";

Document doc = Jsoup.parse(sample);
String output = doc.body().text();

我得到的输出为

This is a sample on parsing html body using jsoup This is a sample on `parsing html body using jsoup`

但是我希望输出为

This is a sample on              parsing html body using jsoup
This is a sample on              parsing html body using jsoup

如何解析它以便获得此输出？ 还是在Java中还有另一种方法？

Answer 1

您可以禁用文档的漂亮打印以获取所需的输出。 但是，您还必须将.text()更改为.html() 。

Document doc = Jsoup.parse(sample);
doc.outputSettings(new Document.OutputSettings().prettyPrint(false));
String output = doc.body().html();

Answer 2

HTML规范要求将多个空格字符折叠成一个空格。 因此，在解析样本时，解析器正确消除了多余的空白字符。

我认为您无法更改解析器的工作方式。 您可以添加一个预处理步骤，用不可破坏的空格（）替换多个空格，该空格不会折叠。 但是，副作用当然是，它们将是不可破坏的（如果您真的只想使用呈现的文本，就像doc.body（）。text（）一样，这都没有关系）。

使用jsoup解析html时避免删除空格和换行符

问题描述

2 个解决方案

解决方案1
3 已采纳 2016-11-03 08:23:09

解决方案2
0 2016-11-03 08:53:05

使用jsoup解析html时避免删除空格和换行符

问题描述

2 个解决方案

解决方案1 3 已采纳 2016-11-03 08:23:09

解决方案2 0 2016-11-03 08:53:05

解决方案1
3 已采纳 2016-11-03 08:23:09

解决方案2
0 2016-11-03 08:53:05