简体   繁体   中英

UTF-8 encoding CSV file

I have a CSV file, which using Excel to save as CSV UTF-8 encoded. I have my java code read the file as byte array

then

String result = new String(b, 0, b.length, "UTF-8");

But somehow the content "Montréal" becomes "Montr?al" when save to DB, what might be the problem?

The environment is unix with:

-bash-4.1$ locale
LANG=
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=

BTW it works on my windows machine when I run my code and see in DB the correct "Montréal". So my guess is that the environment has some default locale setting that forces the use of dedault encoding.

Thanks

I don't have your complete code, but I tried the following code and it works for me:

    String x = "c:/Book2.csv";
    BufferedReader br = null;

    try{
        br = new BufferedReader(new InputStreamReader(new FileInputStream(
                x), "UTF8"));
        String b;
        while ((b = br.readLine()) != null) {
            System.out.println(b);

        }
    } finally {
        if (br != null){
            br.close();
        }
    }

If you see "Montr?al" printed on your console, don't worry. It does not mean that the program is not working. Now, you may want to check if your console supports printing UTF-8 characters. So, you can put a debug and inspect the variable to check if has what you want.

If you see correct value in debug and it prints a "?" in your output, you can rest assured that the String variable is having the right value and you can write it back to any file or DB as needed.

If you see "?" when you query your DB, the tool you may be using is not printing the output correctly. Try reading the DB value in java code an check by putting a debug in you code. I usually use putty to query the DB to see the double byte characters correctly. That's all I have, hope that helps.

You have to use ISO/IEC 8859, not UTF-8, if you look at the list of character encodings on Wikipedia page you'll understand the difference. Basically, UTF-8 its the commom encoding used by western country...

Also, you can check your terminal encoding, maybe the problem is there.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM