简体   繁体   中英

How do I get parsed HTML special characters using JSOUP

I am using JSoup to get the H1 tag value from a webpage, this tag contains the following HTML.

Hexyl β-D-glucopyranoside

When I use the .text() method I get the following. (Note the ?) I assume this is because it cannot work out the HTML for the "β" character. How do I get this value as rendered on a webpage.

Hexyl ?-D-glucopyranoside


Do I need to do some kind of conversion after I have picked up the text I want?

Here is my code.

        String check = "<title>Hexyl &#946;-D-glucopyranoside &#8805;98.0% (TLC) | &#8805; &#8805;</title>";
        Document doc3 = Jsoup.parse(check);
        doc3.outputSettings().escapeMode(Entities.EscapeMode.base); // default

        doc3.outputSettings().charset("UTF-8");
        System.out.println("UTF-8: " + doc3.html());
        //doc3.outputSettings().charset("ISO 8859-1");
        doc3.outputSettings().charset("ASCII");
        System.out.println("ASCII: " + doc3.html());`

-----Output at console-----

    UTF-8: <html>
    <head>
    <title>Hexyl ?-D-glucopyranoside ?98.0% (TLC) | ? ? </title>
     </head>
    <body></body>
   </html>
   ASCII: <html>
    <head>
    <title>Hexyl &#946;-D-glucopyranoside &#8805;98.0% (TLC) | &#8805; &#8805;</title>
     </head>
    <body></body>
    </html>

Looks like the IDE you're using is using the wrong character encoding.

It's nothing to do with your code as I've ran it and it's fine (outputs the weird characters). If you're using Eclipse go to the run configuration settings for that particular project and click the 'common' tab then choose UTF-8.

It's too late to set charset after parsing a document. I had the same problem once, tried to do it your way and failed miserably.

This worked for me:

String url = "url to html page";
InputStream is is =new URL(url).openStream();
org.jsoup.nodes.Document doc = org.jsoup.Jsoup.parse(is , "ISO-8859-2", url);

If I have html text only as string, I convert it to InputString first ( http://www.kodejava.org/examples/265.html )

InputStream is = new ByteArrayInputStream(text.getBytes("UTF-8"));

then read it with correct charset:

BufferedReaderr = new BufferedReader(new InputStreamReader(is, "UTF-8"), 4*1024);
StringBuilder total = new StringBuilder();
String line = "";
while ((line = r.readLine()) != null) {
     total.append(line);
}
r.close();
is.close();
String html = total.toString();

...and parse:

doc = org.jsoup.Jsoup.parse(html);

The important thing is to somehow get InputStream object and from here there're ways to use your desired charset with it. Maybe it can be done in a more strightforward way. But it works.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM