jsoup解析html标签属性

Question

For Example: 例如：

<html>
   <head></head>
   <body sometag='"'></body>
</html>

When I use Jsoup to parse this html like: 当我使用Jsoup解析此类html时：

Document doc = Jsoup.parse(html);
doc.outputSettings().prettyPrint(false);
System.out.println(doc.toString());

It will become 它将成为

<html>
   <head></head>
   <body sometag="&quot;"></body>
</html>

Take notice of the ' and " , I dont't want it parsing ' and " ,I just need it to get some text is there any way to avoid jsoup parsing this. 注意'和'，我不希望它解析'和'，我只需要它来获取一些文本，有什么方法可以避免jsoup解析它。 thanks a lot 非常感谢

Answer 1

Just don't use an HTML parser. 只是不要使用HTML解析器。 Use an XML parser instead. 请改用XML解析器。

Document doc = Jsoup.parse(html, "", Parser.xmlParser());

Answer 2

So I've played around a little bit with different String escaping and the easiest way to achieve this is to do the following: 因此，我在使用不同的字符串转义时做了一些尝试，而实现此目的的最简单方法是执行以下操作：

Though this may not be what you after but we'll see. 尽管这可能不是您追求的目标，但我们会看到的。

String html = "<html> <head> </head> <body sometag='\"'> </body> </html>";

Document doc = Jsoup.parse(html);
doc.outputSettings().escapeMode(Entities.EscapeMode.xhtml);
System.out.println( StringEscapeUtils.unescapeXml( doc.toString() ) );

jsoup解析html标签属性

问题描述

2 个解决方案

解决方案1
0 2018-02-08 06:20:19

解决方案2
0 2018-02-08 11:25:47

jsoup解析html标签属性

问题描述

2 个解决方案

解决方案1 0 2018-02-08 06:20:19

解决方案2 0 2018-02-08 11:25:47

解决方案1
0 2018-02-08 06:20:19

解决方案2
0 2018-02-08 11:25:47