简体   繁体   English

删除HTML实体,同时使用JSoup保留换行符

[英]Removing HTML entities while preserving line breaks with JSoup

I have been using JSoup to parse lyrics and it has been great until now, but have run into a problem. 我一直在用JSoup来解析歌词,直到现在它还很棒,但是遇到了问题。

I can use Node.html() to return the full HTML of the desired node, which retains line breaks as such: 我可以使用Node.html()返回所需节点的完整HTML,它保留了换行符:

Glóandi augu, silfurnátt
<br />Bl&oacute;&eth; alv&ouml;ru, starir &aacute;
<br />&Oacute;&eth;ur hundur er &iacute; v&iacute;gam&oacute;&eth;, &iacute; maga... m&eacute;r
<br />
<br />Kolni&eth;ur gref, kvik sem dreg h&eacute;r
<br />Kolni&eth;ur svart, hvergi bjart n&eacute;

But has the unfortunate side-effect, as you can see, of retaining HTML entities and tags. 但是,正如您所看到的,保留HTML实体和标签会产生令人遗憾的副作用。

However, if I use Node.text() , I can get a better looking result, free of tags and entities: 但是,如果我使用Node.text() ,我可以获得更好看的结果,没有标签和实体:

Glóandi augu, silfurnátt Blóð alvöru, starir á Óður hundur er í vígamóð, í maga... mér Kolniður gref, kvik sem dreg hér Kolniður svart,

Which has another unfortunate side-effect of removing the line breaks and compressing into a single line. 这有另一个令人遗憾的副作用,即删除换行符并压缩成一行。

Simply replacing <br /> from the node before calling Node.text() yields the same result, and it seems that that method is compressing the text onto a single line in the method itself, ignoring newlines. 在调用Node.text()之前简单地从节点替换<br />产生相同的结果,并且似乎该方法将文本压缩到方法本身的单行上,忽略换行符。

Is it possible to have the best of both worlds, and have tags and entities replaced correctly which preserving the line breaks, or is there another method or way of decoding entities and removing tags without having to replace them manually? 是否可以充分利用这两个世界,并且正确地替换标签和实体以保留换行符,或者是否有另一种解码实体和删除标签的方法或方法而无需手动替换它们?

(disclaimer) I haven't used this API ... but a quick look at the docs suggests that you could visit each descendent node and dump out its text contents. (免责声明)我没有使用过这个API ......但是快速查看文档表明您可以访问每个后代节点并转储其文本内容。 Breaks could be inserted when special tags like <br> are encountered. 当遇到像<br>这样的特殊标签时,可以插入中断。

The TextNode.getWholeText() call also looks useful. TextNode.getWholeText()调用看起来也很有用。

based on another answer from stackoverflow I added a few fixes and came with 根据stackoverflow的另一个答案,我添加了一些修复程序并附带了

    String text = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", "br2nl").replaceAll("\n", "br2nl")).text();
    text = text.replaceAll("br2nl ", "\n").replaceAll("br2nl", "\n").trim();

Hope this helps 希望这可以帮助

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM