简体   繁体   English

将文件从cp1252转换为utf -8 java

[英]Convert file from Cp1252 to utf -8 java

User uploads file with the character encoding : Cp1252 用户上传字符编码为Cp1252的文件

Since my mysql table columns Collation as utf8_bin, I try to convert the file to utf-8 before putting the data into table using LOAD DATA INFILE command. 由于我的mysql表列的排序规则为utf8_bin,因此我尝试在使用LOAD DATA INFILE命令将数据放入表之前将文件转换为utf-8。

Java source code: Java源代码:

OutputStream output = new FileOutputStream(destpath);
InputStream input = new FileInputStream(filepath);
BufferedReader reader = new BufferedReader(new InputStreamReader(input, "windows-1252"));
BufferedWriter writ = new BufferedWriter(new OutputStreamWriter(output, "UTF8"));
String in;
while ((in = reader.readLine()) != null) {
    writ.write(in);
    writ.newLine();
}
writ.flush();
writ.close();

It seems that characters are not converted correctly. 似乎字符转换不正确。 Converted unicode file has and box symbols at multiple places. 转换后的unicode文件在多个位置都有``和''框符号。 How to convert file efficiently to uft-8? 如何有效地将文件转换为uft-8? Thanks. 谢谢。

One way of verifying the conversion process is to configure the charset decoder and encoder to bail out on errors instead of silently replacing the erroneous characters with special characters: 验证转换过程的一种方法是将字符集解码器和编码器配置为对错误进行紧急救助,而不是用特殊字符静默替换错误字符:

CharsetDecoder inDec=Charset.forName("windows-1252").newDecoder()
  .onMalformedInput(CodingErrorAction.REPORT)
  .onUnmappableCharacter(CodingErrorAction.REPORT);

CharsetEncoder outEnc=StandardCharsets.UTF_8.newEncoder()
  .onMalformedInput(CodingErrorAction.REPORT)
  .onUnmappableCharacter(CodingErrorAction.REPORT);

try(FileInputStream is=new FileInputStream(filepath);
    BufferedReader reader=new BufferedReader(new InputStreamReader(is, inDec));
    FileOutputStream fw=new FileOutputStream(destpath);
    BufferedWriter out=new BufferedWriter(new OutputStreamWriter(fw, outEnc))) {

    for(String in; (in = reader.readLine()) != null; ) {
        out.write(in);
        out.newLine();
    }
}

Note that the output encoder is configured for symmetry here, but UTF-8 is capable of encoding every unicode character, however, doing it symmetric will help once you want to use the same code for performing other conversions. 请注意,此处将输出编码器配置为对称,但是UTF-8能够对每个unicode字符进行编码,但是,如果您想使用相同的代码执行其他转换,则将其对称将很有帮助。

Further, note that this won't help if the input file is in a different encoding but misinterpreting the bytes leads to valid characters. 此外,请注意,如果输入文件的编码不同,这将无济于事,但错误解释字节会导致有效字符。 One thing to consider is whether the input encoding "windows-1252" actually meant the system's default encoding (and whether that is really the same). 要考虑的一件事是输入编码"windows-1252"实际上意味着系统的默认编码(以及它是否确实相同)。 If in doubt, you may use Charset.defaultCharset() instead of Charset.forName("windows-1252") when the actually intended conversion is defaultUTF-8 . 如果有疑问,可以使用Charset.defaultCharset()代替Charset.forName("windows-1252") ,而实际上要转换为defaultUTF-8

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM