简体   繁体   中英

jsoup parse html tag attribute

For Example:

<html>
   <head></head>
   <body sometag='"'></body>
</html>

When I use Jsoup to parse this html like:

Document doc = Jsoup.parse(html);
doc.outputSettings().prettyPrint(false);
System.out.println(doc.toString());

It will become

<html>
   <head></head>
   <body sometag="&quot;"></body>
</html>

Take notice of the ' and " , I dont't want it parsing ' and " ,I just need it to get some text is there any way to avoid jsoup parsing this. thanks a lot

Just don't use an HTML parser. Use an XML parser instead.

Document doc = Jsoup.parse(html, "", Parser.xmlParser());

So I've played around a little bit with different String escaping and the easiest way to achieve this is to do the following:

Though this may not be what you after but we'll see.

String html = "<html> <head> </head> <body sometag='\"'> </body> </html>";

Document doc = Jsoup.parse(html);
doc.outputSettings().escapeMode(Entities.EscapeMode.xhtml);
System.out.println( StringEscapeUtils.unescapeXml( doc.toString() ) );

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM