简体   繁体   中英

UTF-8 response with servlet

I am reading HTTP response from a Perl page in a Servlet like this:

public String getHTML(String urlToRead) {
        URL url;
        HttpURLConnection conn;
        BufferedReader rd;
        String line;
        String result = "";
        try {
           url = new URL(urlToRead);
           conn = (HttpURLConnection) url.openConnection();
           conn.setRequestMethod("GET");
           conn.setRequestProperty("Accept-Charset", "UTF-8");
           conn.setRequestProperty("Content-Type", "text/xml; charset=UTF-8");

           rd = new BufferedReader(new InputStreamReader(conn.getInputStream(), "UTF-8"));
           while ((line = rd.readLine()) != null) {
              byte [] b = line.getBytes();
              result += new String(b, "UTF-8");
           }
           rd.close();
        } catch (Exception e) {
           e.printStackTrace();
        }
        return result;
   }

I am displaying this result with this code:

response.setContentType("text/plain; charset=UTF-8");

        PrintWriter out = new PrintWriter(new OutputStreamWriter(response.getOutputStream(), "UTF-8"), true);


        try {

            String query = request.getParameter("query");
            String type = request.getParameter("type");

            String res = getHTML(url);
            out.write(res);

        } finally {            
            out.close();
        }

But the response still is not encoded as UTF-8. What am I doing wrong?

Thanks in advance.

That call to line.getBytes() looks suspicious. You should probably make it line.getBytes("UTF-8") if you are certain that what is returned is UTF-8 encoded. Additionally, I'm not sure why it is even necessary. A typical approach to getting data out of a BufferedReader is to use a StringBuilder to continue appending each String retrieved from readLine into a result. The conversion back and forth between String and byte[] is unnecessary.

Change result into a StringBuilder and do this:

while ((line = rd.readLine()) != null) {
    result.append(line);
}

Here is where you break the chain of character encoding conversions:

       while ((line = rd.readLine()) != null) {
          byte [] b = line.getBytes();  // NOT UTF-8
          result += new String(b, "UTF-8");
       }

From String#getBytes() javadoc:

Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array

And, defaullt charset is probably not UTF-8.

But why do all the conversions in the first place? Just read the raw bytes from the source and write the raw bytes to the consumer. It's supposed to be UTF-8 all the way.

I also faced the same problem in another scenario, but just do it I believe it will work:

byte[] b = line.getBytes(UTF8_CHARSET);

in the while loop:

while ((line = rd.readLine()) != null) {
          byte [] b = line.getBytes();  // NOT UTF-8
          result += new String(b, "UTF-8");
       }

In my case, I have do add another configuration .

Previously, I was writing the page this way:

try (PrintStream printStream = new PrintStream(response.getOutputStream()) {
        printStream.print(pageInjecting);
}

I changed to:

try (PrintStream printStream = new PrintStream(response.getOutputStream(), false, "UTF-8")) {
        printStream.print(pageInjecting);
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM