简体   繁体   中英

Java file.encoding on reading UTF-8 file and handling UTF-8 string

I am trying to read UTF-8 encoded XML file and pass UTF-8 string to native code (C++ dll)

My problem is best explained with a Sample program

import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.UnsupportedEncodingException;

public class UniCodeTest {

    private static void testByteConversion(String input) throws UnsupportedEncodingException  {    

        byte[] utf_8 = input.getBytes("UTF-8");  // convert unicode string to UTF-8
        String test = new String(utf_8);         // Build String with UTF-8 equvalent chars 
        byte[] utf_8_converted = test.getBytes();// Get the bytes: in effect this will be called in JNI wrapper on C++ side to read it in char*

        // simple workaround to print hex values
        String utfString = "";
        for (int i = 0; i < utf_8.length; i++) {
            utfString += " " + Integer.toHexString(utf_8[i]);
        }          

        String convertedUtfString = "";
        for (int i = 0; i < utf_8_converted.length; i++) {
            convertedUtfString += " " + Integer.toHexString(utf_8_converted[i]);
        }
        if (utfString.equals(convertedUtfString))   {
            System.out.println("Success" ); 
        }
        else {
            System.out.println("Failure" ); 
        }
    }

    public static void main(String[] args) {
        try {
              File inFile = new File("c:/test.txt");
              BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(inFile), "UTF8"));
              String str;
              while ((str = in.readLine()) != null) {
                  testByteConversion(str);
              }
              in.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

}

and the test file has stored in UTF-8 format (Tamil Locale)

#just test
 நனமை
 நன்மை

I did the following experiments:

  1. set file.encoding property to 'UTF-8' I get success for both inputs

  2. when I set file.encoding to 'CP-1252' First input, i get 'Success' and for the second input I am getting 'Failure'

Here is what I got for the failure case

utf_8           :  e0 ae a8 e0 ae a9 e0 af 8d e0 ae ae e0 af 88
utf_8_converted :  e0 ae a8 e0 ae a9 e0 af 3f e0 ae ae e0 af 88

I do not understand why 8d is converted into 3f when file.encoding set to CP-1252. Can any one please explain me

I miss the link between file.encoding and string manipulation

Thanks in advance :)

I have only diagonally read your post, but this is an odd step:

byte[] utf_8 = input.getBytes("UTF-8");  // convert unicode string to UTF-8
String test = new String(utf_8); 

Because you take a string in java (which is a list of encoding-agnostic unicode codepoints), transform it to bytes with a given encoding (UTF-8) but then you construct a new String without specifying the encoding, so in effect test now contains the utf-8 bytes transformed with the system encoding which may or may not be a valid result depending on what you put in the string and which system encoding you have.

In the next step you get the bytes again from the horrific entity that is "test" in the default encoding. Assuming it even works (as in the bytes from the original UTF-8 string are a valid byte array in whatever system encoding you have), the next step is basically a useless move because it will use the same system encoding you used to construct test:

byte[] utf_8_converted = test.getBytes();

I think this statement is causing the issue:
byte[] utf_8_converted = test.getBytes();

From the documentation of String.getBytes() API:

Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.

The behavior of this method when this string cannot be encoded in the default charset is unspecified. The CharsetEncoder class should be used when more control over the encoding process is required.

Point to note: Default Charset used for conversion is not UTF-8

Try this:

byte[] utf_8_converted = test.getBytes("UTF-8");

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM