简体   繁体   English

Java UTF8编码

[英]Java UTF8 encoding

I have a scenario in which some special characters are stored in a database (sybase) in the system's default encoding and I have to fetch this data and send it to a third-party in UTF-8 encoding using a Java program. 我有一个场景,其中一些特殊字符存储在系统默认编码的数据库(sybase)中,我必须获取此数据并使用Java程序将其发送到UTF-8编码的第三方。

There is precondition that the data sent to the third-party should not exceed a defined maximum size. 前提条件是发送给第三方的数据不应超过定义的最大大小。 Since upon conversion to UTF-8 a character may be replaced by 2 or 3 characters then my logic dictates that after getting the data from the database I must encode it into the UTF-8 string and then split the string. 由于在转换为UTF-8时,一个字符可能被2或3个字符替换,然后我的逻辑规定在从数据库获取数据后,我必须将其编码为UTF-8字符串然后拆分字符串。 The following are my observations: 以下是我的观察:

When any special character like Chinese or Greek characters or any special character > ASCII 256 is encountered and when I convert it into UTF-8, a single character maybe represented by more than 1 byte. 当遇到任何特殊字符,如中文或希腊字符或任何特殊字符> ASCII 256时,当我将其转换为UTF-8时,单个字符可能由超过1个字节表示。

So how can I be sure that the conversion is proper? 那么我怎样才能确定转换是否合适? For conversion I am using the following 对于转换,我使用以下内容

// storing the data from database into string
string s = getdata from the database;

// converting all the data in byte array utf8 encoding
byte [] b = s.getBytes("UTF-8");

// creating a new string as my split logic is based on the string format

String newString = new String(b,"UTF-8");

But when I output this newString to the console I get ? 但是当我将这个newString输出到控制台时,我得到了? for the special characters. 对于特殊字符。

So I have some doubts: 所以我有些疑惑:

  • If my conversion logic is wrong , then how could I correct it. 如果我的转换逻辑错误,那我怎么能纠正它。
  • After doing my conversion to UTF-8, can I double-check whether my conversion is OK or not? 转换为UTF-8后,我可以仔细检查我的转换是否正常? I mean is it the correct message which needs to be sent to the third-party, I assume that if the message is not user-readable after conversion then there is some problem with the conversion. 我的意思是它是需要发送给第三方的正确消息,我假设如果消息在转换后不是用户可读的,那么转换有一些问题。

Would like to have some points of view from all the experts out there. 希望得到所有专家的一些观点。

Please do let me know if any further info is needed from my side. 如果我方需要进一步的信息,请告诉我。

You say you're writing the Unicode to a text file, but that requires a conversion from Unicode. 您说您正在将Unicode写入文本文件,但这需要从Unicode进行转换。

But a conversion to what? 但转换成什么? That depends on how you open the file. 这取决于您打开文件的方式。

For example, System.out.println(myUnicodeString) will convert the Unicode to the encoding that System.out was constructed with, most likely your platform's default encoding. 例如, System.out.println(myUnicodeString)将Unicode转换为构造System.out的编码,很可能是您平台的默认编码。 If you're running Windows, then this is likely to be windows-1252 . 如果你正在运行Windows,那么很可能是windows-1252

If you tell Java to use UTF-8 encoding when it writes to a file, you'll get a file containing UTF-8: 如果您告诉Java在写入文件时使用UTF-8编码,您将获得包含UTF-8的文件:

PrintWriter pw = new PrintWriter(new FileOutputStream("filename.txt"), "UTF-8");
pw.println(myUnicodeString);

Please use a hex-editor to verify if your output is correctly formatted UTF8. 请使用十六进制编辑器验证您的输出是否格式正确UTF8。 There is no other way to tell for sure if what you see is corrector not. 没有其他方法可以确定您所看到的是否是校正器。

And read this if you have not ready: http://www.joelonsoftware.com/articles/Unicode.html 如果你还没准备好,请阅读: http//www.joelonsoftware.com/articles/Unicode.html

Use this for proper converstion - this one is from iso-8859-1 to utf-8: 使用它进行正确的转换 - 这个是从iso-8859-1到utf-8:

public String to_utf8(String fieldvalue) throws UnsupportedEncodingException{

        String fieldvalue_utf8 = new String(fieldvalue.getBytes("ISO-8859-1"), "UTF-8");
        return fieldvalue_utf8;
}

Java strings are unicode, but not all java components support full unicode strings, especially AWT components and lightweight swing components. Java字符串是unicode,但并非所有java组件都支持完整的unicode字符串,尤其是AWT组件和轻量级swing组件。 So you may have perfectly good strings, but get junk in your console output. 所以你可能有完美的字符串,但在你的控制台输出中得到垃圾。

thanks all for your replies.. 谢谢大家的回复..

As suggested by some of you , I already tried writing it to a text file , however in text file also I got ? 正如你们中的一些人所建议的,我已经尝试将其写入文本文件,但是在文本文件中我也得到了吗? for the my special characters. 为了我的特殊人物。 So i have the following observations:- 所以我有以下观察: -

a) Encoding is a two fold process, frst u change the string from one encoding to another encoding on byte level and then u also have to have the required font for the new character set. a)编码是一个双重过程,首先你在字节级别将字符串从一个编码更改为另一个编码,然后你还必须拥有新字符集所需的字体。

b) If we are encoding some string that means we are encoding the bytes , for the current scenario, I am using the double quotes from the MS word and then inserting into a sybase databse, and after fetching the data from db , i am writing it to a txt file , where i am getting the same ? b)如果我们编码一些字符串意味着我们正在编码字节,对于当前场景,我使用MS字中的双引号然后插入到sybase数据库中,并且在从db获取数据后,我正在写它到一个txt文件,我得到相同的? for double quotes , however if i directly copy the same stuff from the db to MS word or edit plus I can see the actual characters . 对于双引号,但是如果我直接将相同的东西从db复制到MS字或编辑加上我可以看到实际的字符。 so i am not able to comprehend this problem. 所以我无法理解这个问题。 As per my understanding, during encoding we should be concerned only about the byte value which are the real representations and not the string object whcih we constitute out of these byte arrays.However, unless my encoded information is not human readable how can other party validate it and read it (I am guessing these would be reading bytes , but if for a special character some ? like junk character has been introduced while utf8 encoding , then is not is an info loss). 根据我的理解,在编码过程中我们应该只关注字节值是真正的表示而不是我们用这些字节数组构成的字符串对象。但是,除非我的编码信息不是人类可读的,否则其他方可以验证它并读取它(我猜这些将是读取字节,但如果对于一个特殊字符有些?像utf8编码引入了垃圾字符,那么不是信息丢失)。

Would really appreciate your views on my observations and what correct approach should I follow further? 非常感谢您对我观察的看法以及我应该采取哪些正确的方法?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM