简体   繁体   中英

Convert file from Cp1252 to utf -8 java

User uploads file with the character encoding : Cp1252

Since my mysql table columns Collation as utf8_bin, I try to convert the file to utf-8 before putting the data into table using LOAD DATA INFILE command.

Java source code:

OutputStream output = new FileOutputStream(destpath);
InputStream input = new FileInputStream(filepath);
BufferedReader reader = new BufferedReader(new InputStreamReader(input, "windows-1252"));
BufferedWriter writ = new BufferedWriter(new OutputStreamWriter(output, "UTF8"));
String in;
while ((in = reader.readLine()) != null) {
    writ.write(in);
    writ.newLine();
}
writ.flush();
writ.close();

It seems that characters are not converted correctly. Converted unicode file has and box symbols at multiple places. How to convert file efficiently to uft-8? Thanks.

One way of verifying the conversion process is to configure the charset decoder and encoder to bail out on errors instead of silently replacing the erroneous characters with special characters:

CharsetDecoder inDec=Charset.forName("windows-1252").newDecoder()
  .onMalformedInput(CodingErrorAction.REPORT)
  .onUnmappableCharacter(CodingErrorAction.REPORT);

CharsetEncoder outEnc=StandardCharsets.UTF_8.newEncoder()
  .onMalformedInput(CodingErrorAction.REPORT)
  .onUnmappableCharacter(CodingErrorAction.REPORT);

try(FileInputStream is=new FileInputStream(filepath);
    BufferedReader reader=new BufferedReader(new InputStreamReader(is, inDec));
    FileOutputStream fw=new FileOutputStream(destpath);
    BufferedWriter out=new BufferedWriter(new OutputStreamWriter(fw, outEnc))) {

    for(String in; (in = reader.readLine()) != null; ) {
        out.write(in);
        out.newLine();
    }
}

Note that the output encoder is configured for symmetry here, but UTF-8 is capable of encoding every unicode character, however, doing it symmetric will help once you want to use the same code for performing other conversions.

Further, note that this won't help if the input file is in a different encoding but misinterpreting the bytes leads to valid characters. One thing to consider is whether the input encoding "windows-1252" actually meant the system's default encoding (and whether that is really the same). If in doubt, you may use Charset.defaultCharset() instead of Charset.forName("windows-1252") when the actually intended conversion is defaultUTF-8 .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM