使用jsoup解析html时避免删除空格和换行符

Question

I have a sample code as below. 我有一个示例代码如下。

String sample = "<html>
<head>
</head>
<body>
This is a sample on              parsing html body using jsoup
This is a sample on              parsing html body using jsoup
</body>
</html>";

Document doc = Jsoup.parse(sample);
String output = doc.body().text();

I get the output as 我得到的输出为

This is a sample on parsing html body using jsoup This is a sample on `parsing html body using jsoup`

But I want the output as 但是我希望输出为

This is a sample on              parsing html body using jsoup
This is a sample on              parsing html body using jsoup

How do parse it so that I get this output? 如何解析它以便获得此输出？ Or is there another way to do so in Java? 还是在Java中还有另一种方法？

Answer 1

You can disable the pretty printing of your document to get the output like you want it. 您可以禁用文档的漂亮打印以获取所需的输出。 But you also have to change the .text() to .html() . 但是，您还必须将.text()更改为.html() 。

Document doc = Jsoup.parse(sample);
doc.outputSettings(new Document.OutputSettings().prettyPrint(false));
String output = doc.body().html();

Answer 2

The HTML specification requires that multiple whitespace characters are collapsed into a single whitespace. HTML规范要求将多个空格字符折叠成一个空格。 Therefore, when parsing the sample, the parser correctly eliminates the superfluous whitespace characters. 因此，在解析样本时，解析器正确消除了多余的空白字符。

I don't think you can change how the parser works. 我认为您无法更改解析器的工作方式。 You could add a preprocessing step where you replace multiple whitespaces with non-breakable spaces ( ), which will not collapse. 您可以添加一个预处理步骤，用不可破坏的空格（）替换多个空格，该空格不会折叠。 The side effect, though, would of course be that those would be, well, non-breakable (which doesn't matter if you really just want to use the rendered text, as in doc.body().text()). 但是，副作用当然是，它们将是不可破坏的（如果您真的只想使用呈现的文本，就像doc.body（）。text（）一样，这都没有关系）。

使用jsoup解析html时避免删除空格和换行符

问题描述

2 个解决方案

解决方案1
3 已采纳 2016-11-03 08:23:09

解决方案2
0 2016-11-03 08:53:05

使用jsoup解析html时避免删除空格和换行符

问题描述

2 个解决方案

解决方案1 3 已采纳 2016-11-03 08:23:09

解决方案2 0 2016-11-03 08:53:05

解决方案1
3 已采纳 2016-11-03 08:23:09

解决方案2
0 2016-11-03 08:53:05