简体   繁体   English

java中UTF-8到ASCII的转换

[英]UTF-8 to ASCII conversion in java

I have one string which contains UTF-8 character set format.我有一个包含 UTF-8 字符集格式的字符串。

String str = "100µF";

And my desire output of above string is "100µF"我希望上面字符串的输出是“100μF”

I have checked on StackOverflow and i got below code我已经检查了 StackOverflow,我得到了下面的代码

public static String decompose(String s) {
    return java.text.Normalizer.normalize(s, java.text.Normalizer.Form.NFD).replaceAll("\\p{InCombiningDiacriticalMarks}+","");
}

But, I got the output of above string was "100AµF"但是,我得到上面字符串的输出是“100AµF”

This is an XY problem .这是一个XY 问题

The problem here is that your String was created from bytes, using an incorrect charset that assumes one byte is one character, like like ISO 8559-1 .这里的问题是您的 String 是从字节创建的,使用了不正确的字符集,假设一个字节是一个字符,例如ISO 8559-1

But the bytes are not ASCII and they are not ISO 8859-1.但字节不是 ASCII,也不是 ISO 8859-1。 The bytes are a UTF-8 representation of text.字节是文本的 UTF-8 表示。

Do not replace any characters.不要替换任何字符。 Do not normalize the string.不要规范化字符串。 The only correct solution is to revert the incorrectly decoded String back to bytes, then correctly decode the bytes using UTF-8:唯一正确的解决方案是将错误解码的字符串恢复为字节,然后使用 UTF-8 正确解码字节:

byte[] originalBytes = str.getBytes(StandardCharsets.ISO_8859_1);

str = new String(originalBytes, StandardCharsets.UTF_8);

There is no µ char in ASCII, so you can't write it in ASCII. ASCII 中没有µ字符,所以你不能用 ASCII 写它。

Java String s are sequence of unicode characters (and are internally encoded in UTF-16), so the problem you have depends either on how you read this string or on how you write it. Java String是 unicode 字符序列(并且内部以 UTF-16 编码),因此您遇到的问题取决于您如何读取此字符串或如何编写它。

Normally this thing are solved by creating an OutputStreamWriter(OutputStream out, String charsetName) or InputStreamReader(InputStream in, String charsetName) setting the correct character set.通常这件事是通过创建一个OutputStreamWriter(OutputStream out, String charsetName)InputStreamReader(InputStream in, String charsetName)设置正确的字符集来解决的。

So if for example you get your string from an UTF-8 encoded file, you should create a reader like this:因此,例如,如果您从 UTF-8 编码的文件中获取字符串,您应该创建一个这样的阅读器:

Reader in = new InputStreamReader(new FileInputStream('some_file.txt'),"UTF-8")

Or if you are writing to an ISO-Latin-1 file you should create the Writer like this:或者,如果您正在写入 ISO-Latin-1 文件,您应该像这样创建 Writer:

Writer out = new OutputStreamWriter(new FileOutputStream('some_file.txt'),"ISO-8859-1")

Similar things can happen with HTTP request / response, depending on how the body of each is interpreted by either the application server or browser, if that's your case, then you add some detail to your question. HTTP 请求/响应可能会发生类似的事情,这取决于应用程序服务器或浏览器如何解释每个请求/响应的主体,如果是这种情况,那么您可以在问题中添加一些细节。

You are dealing with µ (U+00B5, MICRO SIGN) and  (U+00C2, LATIN CAPITAL LETTER A WITH CIRCUMFLEX).您正在处理µ (U+00B5,微号)和 (U+00C2,带圆圈的拉丁文大写字母 A)。 Both these characters belong to Latin-1 Supplement unicode block .这两个字符都属于Latin-1 Supplement unicode block

If you want to allow µ but disallow  you have to do the filtering yourself.如果您想允许µ但不允许Â您必须自己进行过滤。 There won't be a separate character group ( \\p{} ) for each of the characters.每个字符都不会有单独的字符组 ( \\p{} )。

One way to do it is to define a white-list filter:一种方法是定义一个白名单过滤器:

String input = "100µF";
String allowedFilter = "[^\\p{ASCII}µ]"; // regular ASCII + µ sign
String output = input.replaceAll(allowedFilter, "");
System.out.println(output); // 100µF

Do note that both µ and  can be represented in Extended ASCII so filtering one and not the other is counter intuitive.请注意, µÂ都可以用扩展 ASCII表示,因此过滤一个而不是另一个是违反直觉的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM