简体   繁体   中英

Keep unicode characters in Java string

I'm writing a crawler in java to crawl some websites, which may have some unicode characters such as "£". When I stored the content (source HTML) in a Java String, these kinds of chars get lost and are replaced by the question mark "?". I'd like to know how to keep them intact. The related code is as follows:

protected String readWebPage(String weburl) throws IOException{
        HttpClient httpclient = new DefaultHttpClient();

        HttpGet httpget = new HttpGet(weburl); 
        ResponseHandler<String> responseHandler = new BasicResponseHandler();    
        String responseBody = httpclient.execute(httpget, responseHandler);
        // responseBody now contains the contents of the page
        httpclient.getConnectionManager().shutdown();
        return responseBody;
    }

   // function call
   String res = readWebPage(url);
   PrintWriter out = new PrintWriter(outDir+name+".html");
   out.println(res);
   out.close();

And later when doing character matches, I also want to be able to do something like:

if(text.indexOf("£")>=0)

I don't know if Java will recognize that character and do as what I want it to do.

Any input will be greatly appreciated. Thanks in advance.

Use following code:

FileOutputStream fileStream = new FileOutputStream(outDir+name+".html");
OutputStreamWriter outputStreamWriter = new OutputStreamWriter(fileStream, StandardCharsets.UTF_8)
PrintWriter out = new PrintWriter(outputStreamWriter);

From Charset

A character-encoding scheme is a mapping between one or more coded character sets and a set of octet (eight-bit byte) sequences. UTF-8, UTF-16, ISO 2022, and EUC are examples of character-encoding schemes. Encoding schemes are often associated with a particular coded character set; UTF-8, for example, is used only to encode Unicode. Some schemes, however, are associated with multiple coded character sets; EUC, for example, can be used to encode characters in a variety of Asian coded character sets.

Your non-ASCII characters are either getting lost on input to Java or on output.

Java works with Unicode strings internally so you have to tell it how to decode input and encode output.

Let's assume that HttpClient is correctly interpreting the response from the remote server and is decoding the response correctly.

Next up, you have to ensure that you encode the contents correctly when you write it to disk. Java uses local environment variables to guess what encoding to use, which may not be suitable. To force the encoding, pass the encoding type to PrintWriter:

PrintWriter out = new PrintWriter(outDir+name+".html", "UTF-8");

Then check your output.html with a text editor, such as Notepad++, running in UTF-8 mode to ensure that you can still see non-ASCII chars.

If you can't then you'll need to turn your attention to the input - HttpClient. See this answer: Set response encoding with HttpClient 3.1 for clues if your remote server is lying about the character encoding.

In answer to your sub-question. You can use non-ASCII chars, such as "£", in your source code if you tell Java what character encoding your source code is in. This is a parameter to javac but as you're likely to be using an IDE, you can simply set the character encoding of your file in the properties and the IDE will do the rest. The most portable thing to do is set your character encoding in your IDE to "UTF-8". Eclipse allows you to set the character encoding for the whole project or on individual files.

There are two steps. First you save the loaded String (in java always Unicode) as UTF-8. But as the browser needs to know the encoding, it has only the HTML meta tags on the file system. So you need to make sure, there is something like

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

2. Write the HTML with UTF-8

PrintWriter out = new PrintWriter(outDir+name+".html", "UTF-8");

1. Patch the HTML charset declaration of the original page into UTF-8 first.

String res2 = res.replaceFirst("charset=([-\\w]+)", "charset=UTF-8")
         .replaceFirst("charset=([\"'])([-\\w]+)\1", "charset=$1UTF-8$1");
if (res2 == res) { // No charset given
      res2 = res.replaceFirst("(?i)</head>",
              "<meta http-equiv='Content-Type' content='text/html; charset=UTF-8' />$0");
}
res = res2;

For HTML meta with either Content-Type or (HTML5) charset.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM