简体   繁体   中英

How to convert UTF-8 character to ISO Latin 1?

I need to convert a UTF-8 trademark sign to a ISO Latin 1, and save it into database, which is also ISO Latin 1 encoded.

How can I do that in java?

I've tried something like

String s2 = new String(s1.getBytes("ISO-8859-1"), "utf-8");

but it seems not work as I expected.

A string in Java is always in Unicode (UTF-16, effectively). Conversions are only necessary when you're trying to go from text to a binary encoding or vice versa.

What's the character involved? Are you sure it's even present in ISO Latin 1? If it is, I'd expect that character to be stored by your database without any problem. There's no such thing as a "UTF-8 trademark sign". You could have "the bytes representing the trademark sign UTF-8 encoded" but that would be a byte array, not a string.

EDIT: If you mean the Unicode trademark character U+2122, that's outside the range of ISO-Latin-1. There's the registered trademark character U+00AE, which isn't the same thing (either in appearance or in legal meaning, IIRC) but may be better than nothing - if you want to use that then just use:

string replaced = original.replace('\u2122', '\u00ae');

As far as I understand, you are trying to store characters (from s1 ) that contains non Latin-1 characters into a DB that only supports ISO-8859-1.

  • First, I agree with the others to say that it is a dirty idea.
    Note that CP1252 is close from ISO-8859-1 (1 byte per character) and includes

  • Now, to anwser your question, I think you did the opposite..
    You want to encode UTF-8 bytes into ISO-8859-1 :

     String s2 = new String(s1.getBytes("UTF-8"), "ISO-8859-1"); 

    This way, s2 is a characher String that, once encoded in ISO-8859-1, will return a byte array which may look like valid UTF-8 bytes.

    To retrieve the original string, you would do

     String s1 = new String(s2.getBytes("ISO-8859-1"),"UTF-8"); 

BUT WAIT ! When doing this, you hope that any byte can be decoded with ISO-8859-1 .. and that your DB will accept such data. etc..

In fact, it is really unsure because officially, ISO-8859-1 doesn't have chars for any byte values . For instance, from 80 to 9F.

Then,

byte[] b = { -97, -100, -128 };
System.out.println( new String(b,"ISO-8859-1") );

would display ???

However, in Java , s.getBytes("ISO-8859-1") indeed restores the initial array.

  1. Read what Jon Skeet told you. The Code you posted is rubbish (it takes the UTF-8 encoded form of your String and interprets it as if it were ISO-8859-1, this accomplishes nothing useful).
  2. The ISO-8859-1 encoding (aka Latin1) doesn't contain the Trademark character "™".

I had a similar problem and solved it by converting the the none-translatable chars in Entitys. If you display the information later as html you are fine anyway.

If not, you could try to convert them back to unicode.

example in python with "Trademark":

s = u'yellow bananas\u2122'.encode('latin1', 'xmlcharrefreplace')
# s is 'yellow bananas™'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM