简体   繁体   English

恢复编码错误的字符(Java)

[英]Recover wrongly encoded character (Java )

We ran some java code using cron in Linux to persist thousands of records in production database. 我们在Linux中使用cron运行了一些Java代码,以在生产数据库中保留数千条记录。 The locale charmap in that box was "ANSI_X3.4-1968". 该框中的语言环境charmap是“ ANSI_X3.4-1968”。 Now, we took following steps before persisting those to database. 现在,我们将以下步骤持久化到数据库。 1. Use StringEscapeUtils.unescapeHtml4 on the text 2. Write the String in UTF-8 format and persist in database 1.在文本上使用StringEscapeUtils.unescapeHtml4。2.以UTF-8格式写入String并保存在数据库中

Now the problem is after these steps special characters are showing up as "?". 现在的问题是在执行这些步骤后,特殊字符显示为“?”。 Is it possible to revert it back to the original character? 是否可以将其还原为原始字符? I have simulated the problem with following steps. 我已经按照以下步骤模拟了这个问题。

  1. Change Eclipse encoding to "ANSI_X3.4-1968" 将Eclipse编码更改为“ ANSI_X3.4-1968”
  2. Write following lines of codes 编写以下代码行


String insertSpecial = StringEscapeUtils.unescapeHtml4("×");
System.out.println(insertSpecial);
String uni = new String(insertSpecial.getBytes(), "UTF-8");// This value is currently in DB
System.out.println(uni);

Now I want to get back "×" from the String "uni". 现在,我想从字符串“ uni”中获取“×”。 Any help will be appreciated. 任何帮助将不胜感激。

Basically no. 基本上没有 You made the biggest mistake in new String(insertSpecial.getBytes(), "UTF-8"); 您在new String(insertSpecial.getBytes(), "UTF-8");犯了最大的错误new String(insertSpecial.getBytes(), "UTF-8"); which again shows that character encoding is surprisingly difficult to handle. 这再次表明字符编码出奇地难以处理。

What that piece of code does, step by step: 该代码段的作用是分步进行的:

  1. Give me the bytes from insertSpecial in the platform encoding 给我平台编码中insertSpecial的字节
  2. Create a new String from the bytes, telling that the bytes are UTF-8 (even though the bytes were gotten in platform encoding just previously) 从字节创建一个新的String,告诉它们字节是UTF-8(即使字节是以前在平台编码中获得的)

I've seen this code several times, and unfortunately it only breaks things. 我已经看过几次这个代码,不幸的是它只会破坏事情。 It's completely unnecessary and it doesn't "convert" anything even if it were written correctly. 这是完全不必要的,即使编写正确,也不会“转换”任何内容。 If the platform encoding is not UTF-8 then it will most likely destroy any special characters (or even the whole String if there's a suitable difference between platform encoding and the one given in the String constructor). 如果平台编码不是UTF-8那么它很可能会破坏任何特殊字符(如果平台编码与String构造函数中给出的编码之间存在适当的区别,则甚至会破坏整个String)。

The question mark is a placeholder for a character that could not be converted, meaning it's forever gone. 问号是无法转换的字符的占位符,表示该字符已永远消失。

Here's some reading so you won't make that mistake again: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) 这是一些阅读材料,因此您不会再犯该错误: 每个软件开发人员绝对绝对要完全了解Unicode和字符集(没有任何借口!)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM