简体   繁体   English

在Java中将ANSI字符转换为UTF-8

[英]Convert ANSI characters to UTF-8 in Java

Is there a way to convert an ANSI string to UTF using Java. 有没有办法使用Java将ANSI字符串转换为UTF。

I have a custom serializer that uses readUTF & writeUTF methods of the DataInputStream class to deserialize and serialze string. 我有一个自定义序列化程序,它使用DataInputStream类的readUTF和writeUTF方法来反序列化和序列化字符串。 If i receive a string encoded in ANSI and is too long, ~100000 chars long i get the error; 如果我收到一个用ANSI编码的字符串太长,大约100000个字符我得到错误;

Caused by: java.io.UTFDataFormatException: encoded string too long: 106958 bytes 引起:java.io.UTFDataFormatException:编码的字符串太长:106958字节

However in my Junit tests i'm able create a string with 120000 'a's and it works perfectly 然而,在我的Junit测试中,我能够创建一个120000'a的字符串,它完美无缺

I have checked the following posts but still having errors; 我检查过以下帖子但仍有错误;

This error is not caused by character encoding. 此错误不是由字符编码引起的。 It means the length of the UTF data is wrong. 这意味着UTF数据的长度是错误的。

EDIT: Just realized this is a writing error, not reading error. 编辑:刚刚意识到这是一个写错误,而不是读错误。

The UTF length is only 2 bytes so it can only hold 64K UTF-8 bytes. UTF长度只有2个字节,因此它只能容纳64K UTF-8字节。 You are trying to writing 100K, it's not going to work. 你正在尝试写100K,它不会起作用。

This limit is hardcoded and no way to get around this, 这个限制是硬编码的,无法解决这个问题,

if (utflen > 65535)
    throw new UTFDataFormatException(
            "encoded string too long: " + utflen + " bytes");
byte[] asciiBytes = ...;
String unicode = new String(asciiBytes, "US-ASCII");
byte[] utfBytes = unicode.getBytes("UTF-8");

Which ANSI codepage ? 哪个ANSI代码页 There are lots of different character encodings which all refer to "ANSI". 有许多不同的字符编码都是“ANSI”。 The DOS codepage is 437 (without the drawing symbols). DOS代码页是437(没有绘图符号)。 If you use codepage 850, this will work: 如果您使用代码页850,这将工作:

String unicode = new String(bytes, "IBM850");

(where bytes is an array with the ANSI characters). (其中bytes是一个带有ANSI字符的数组)。 After that, you can convert this string into a byte array with any encoding using unicode.getBytes(encoding) . 之后,您可以使用unicode.getBytes(encoding)将此字符串转换为具有任何编码的字节数组。

Windows often uses the codepage 1252 (use "windows-1252" for that). Windows经常使用代码页1252(使用“windows-1252”)。

ZZ Coder already answered the question, but I have written a more detailed explanation and suggesting a workaround on this blog . ZZ Coder已经回答了这个问题,但我已经写了一个更详细的解释,并建议在这个博客上找到解决方法。 Basically, the problem is in DataOutputStream, because it restricts the writeable String to 64KB. 基本上,问题出在DataOutputStream中,因为它将可写String限制为64KB。 There are other possible workarounds to bystep the issue, some might work without breaking the actual binary data format one is using... 还有其他可能的解决方法来解决问题,有些可能会工作而不会破坏正在使用的实际二进制数据格式...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM