如何在Jsoup中删除HTML实体？

Question

How to remove HTML Entities using Jsoup? 如何使用Jsoup删除HTML实体？ If I use Element.toString(), I get: 如果我使用Element.toString（），则会得到：

(...)
       <td>Letter &oacute;</td> //valid: <td>Letter ó</td>
(...)

Answer 1

I believe you can specify an encoding when you create a Jsoup Document something like this: 我相信您在创建Jsoup文档时可以指定一种编码，如下所示：

Document newDocument = Jsoup.parse(htmlString, StringUtils.EMPTY, Parser.htmlParser());
newDocument.outputSettings().escapeMode(EscapeMode.base);
newDocument.outputSettings().charset(CharEncoding.UTF_8);

Answer 2

This may be off-topic to the context of your question, but if you want to just decode HTML-entities without any other changes in the string (no tag processing, no comment stripping, etc) you can use org.jsoup.parser.Parser.unescapeEntities , eg: 这可能与您的问题无关，但是，如果您只想解码HTML实体而无需在字符串中进行任何其他更改（不进行标签处理，不org.jsoup.parser.Parser.unescapeEntities注释等），则可以使用org.jsoup.parser.Parser.unescapeEntities ，例如：

assert Parser.unescapeEntities("x &asymp; <i>y</i>\n", true)
    .equals("x ≈ <i>y</i>\n");

如何在Jsoup中删除HTML实体？

问题描述

2 个解决方案

解决方案1
3 已采纳 2013-11-13 21:04:31

解决方案2
3 2017-09-11 23:13:41

如何在Jsoup中删除HTML实体？

问题描述

2 个解决方案

解决方案1 3 已采纳 2013-11-13 21:04:31

解决方案2 3 2017-09-11 23:13:41

解决方案1
3 已采纳 2013-11-13 21:04:31

解决方案2
3 2017-09-11 23:13:41