简体   繁体   中英

Java encoding - corrupted French characters

I have a system, where I got French Text from third party, but I am facing hard time to get it readable.

String frenchReceipt = "RETIR�E"; // The original Text should be "RETIRÉE"

I tried all possible combinations to convert the string using encoding of UTF-8 and ISO-8859-1

String frenchReceipt = "RETIR�E"; // The original Text should be "RETIRÉE"

byte[] b1 = new String(frenchReceipt.getBytes()).getBytes("UTF-8"); 
System.out.println(new String(b1));  // RETIR�E

byte[] b2 = new String(frenchReceipt.getBytes()).getBytes("ISO-8859-1"); 
System.out.println(new String(b2));  // RETIR�E

byte[] b3 = new String(frenchReceipt.getBytes(), "UTF-8").getBytes(); 
System.out.println(new String(b3));  // RETIR?E 

byte[] b4 = new String(frenchReceipt.getBytes(), "UTF-8").getBytes(); 
System.out.println(new String(b4));  //RETIR?E

byte[] b5 = new String(frenchReceipt.getBytes(), "ISO-8859-1").getBytes("UTF-8"); 
System.out.println(new String(b5));  //RETIR�E

byte[] b6 = new String(frenchReceipt.getBytes(), "UTF-8").getBytes("ISO-8859-1"); 
System.out.println(new String(b6));  //RETIR?E

byte[] b7 = new String(frenchReceipt.getBytes(), "UTF-8").getBytes("UTF-8"); 
System.out.println(new String(b7));  //RETIR�E

byte[] b8 = new String(frenchReceipt.getBytes(), "ISO-8859-1").getBytes("ISO-8859-1"); 
System.out.println(new String(b8));  //RETIR�E

As you see nothing fix the problem.

Please advise.

Update: The third -party partner confirmed that data sent to my application in "ISO-8859-1" Encoding

� is just a replacement character ( EF|BF|BD UTF-8) and used to indicate problems when a system is unable to render a correct symbol. It means that you have no chance to convert � into É.

frenchReceipt doesn't contain any byte sequence which could be converted into É because of the declaration:

String frenchReceipt = "RETIR�E";

Your code snippet below should work pretty fine but you have to use the correct byte source.

byte[] b2 = new String(frenchReceipt.getBytes()).getBytes("ISO-8859-1");
System.out.println(new String(b2));

So if you read "RETIRÉE" by bytes from a data source and get 52|45|54|49|52|C9|45 (ISO-8859-1 is expected) then you'll get the proper result. If the data source has already the byte sequence EF|BF|BD the only option you have is search&replace, but in this case, there is no difference between ie ä and É.

Update : Since the data are delivered by TCP

new BufferedReader(new InputStreamReader(connection.getInputStream(),"ISO-8859-1"))

solved the issue.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM