简体   繁体   English

如何使用Jsoup从HTML解析新行

[英]How to parse new line from HTML using Jsoup

When i am parsing a HTML file using jsoup, texts in multiple lines (with < br /> ) in the HTML file is presented as a single line without new lines( \\n ). 当我使用jsoup解析HTML文件时,HTML文件中多行(带有< br /> />)的文本显示为单行,没有新行( \\n )。 How i can parse the multi line HTML document as multiline strings ?? 我如何将多行HTML文档解析为多行字符串?

I am using the method: Element.text() 我正在使用方法: Element.text()

Eg: 例如:

HTML contains C code which is properly displayed in multiple lines in HtMl file, but when i am taking the text data, all the data are presented in a single line without new line charactors. HTML包含在HtMl文件中以多行正确显示的C代码,但是当我获取文本数据时,所有数据都在一行中显示而没有新的行描述符。

Replace <br /> with something else and back, like this: <br />替换为其他内容并返回,如下所示:

Document doc = Jsoup.connect("http://www.ejemplo.html").get(); //Here included the <br>'s
String temp = doc.html().replace("<br />", "$$$"); //$$$ instead <br>
doc = Jsoup.parse(temp); //Parse again

String text = doc.body().text().replace("$$$", "\n").toString()); //example
//I get back the new lines (\n)

The text() method of Element (and TextNode ) calls appendWhitespaceIfBr(...) which will replace every <br /> (or whitespace) with a blank. Element(和TextNode )的text()方法调用appendWhitespaceIfBr(...) ,它将用空白替换每个<br /> (或空格)。 Unfortunately i see no mechanism for turning this off without working on the code. 不幸的是,我认为没有机制可以在不使用代码的情况下关闭它。

But maybe you can try replacing all <br /> Tags with a new subclass of Node . 但也许您可以尝试用Node的新子类替换所有<br />标签。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM