简体   繁体   English

使用jsoup解析html时避免删除空格和换行符

[英]Avoid removal of spaces and newline while parsing html using jsoup

I have a sample code as below. 我有一个示例代码如下。

String sample = "<html>
<head>
</head>
<body>
This is a sample on              parsing html body using jsoup
This is a sample on              parsing html body using jsoup
</body>
</html>";

Document doc = Jsoup.parse(sample);
String output = doc.body().text();

I get the output as 我得到的输出为

This is a sample on parsing html body using jsoup This is a sample on `parsing html body using jsoup`

But I want the output as 但是我希望输出为

This is a sample on              parsing html body using jsoup
This is a sample on              parsing html body using jsoup

How do parse it so that I get this output? 如何解析它以便获得此输出? Or is there another way to do so in Java? 还是在Java中还有另一种方法?

You can disable the pretty printing of your document to get the output like you want it. 您可以禁用文档的漂亮打印以获取所需的输出。 But you also have to change the .text() to .html() . 但是,您还必须将.text()更改为.html()

Document doc = Jsoup.parse(sample);
doc.outputSettings(new Document.OutputSettings().prettyPrint(false));
String output = doc.body().html();

The HTML specification requires that multiple whitespace characters are collapsed into a single whitespace. HTML规范要求将多个空格字符折叠成一个空格。 Therefore, when parsing the sample, the parser correctly eliminates the superfluous whitespace characters. 因此,在解析样本时,解析器正确消除了多余的空白字符。

I don't think you can change how the parser works. 我认为您无法更改解析器的工作方式。 You could add a preprocessing step where you replace multiple whitespaces with non-breakable spaces ( ), which will not collapse. 您可以添加一个预处理步骤,用不可破坏的空格()替换多个空格,该空格不会折叠。 The side effect, though, would of course be that those would be, well, non-breakable (which doesn't matter if you really just want to use the rendered text, as in doc.body().text()). 但是,副作用当然是,它们将是不可破坏的(如果您真的只想使用呈现的文本,就像doc.body()。text()一样,这都没有关系)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM