简体   繁体   English

在Java中将unicode字符串转换为ASCII,可在unix / linux中使用

[英]convert unicode string to ASCII in java which works in unix/linux

I have already tried using Normalizer 我已经尝试使用Normalizer

String s = "口水雞 hello Ä";

String s1 = Normalizer.normalize(s, Normalizer.Form.NFKD);
String regex = Pattern.quote("[\\p{InCombiningDiacriticalMarks}\\p{IsLm}\\p{IsSk}]+");

String s2 = new String(s1.replaceAll(regex, "").getBytes("ascii"), "ascii");

System.out.println(s2);
System.out.println(s.length() == s2.length());

i want it to work in Unix/Linux , 我希望它可以在Unix / Linux中工作

There is an ASCII character class for matching code points in the ASCII set: 有一个ASCII字符类,用于匹配ASCII集中的代码点:

String s = "口水雞 hello Ä";

String s1 = Normalizer.normalize(s, Normalizer.Form.NFKD);
String nonAscii = "[^\\p{ASCII}]+";
String s2 = s1.replaceAll(nonAscii, "");

System.out.println(s2);
System.out.println(s.length() == s2.length());

As Joop Eggan notes , Java string and char types are always UTF-16. Joop Eggan所述 ,Java字符串和char类型始终为UTF-16。 You can only have ASCII-encoded data in byte form: 您只能使用字节形式的ASCII编码数据:

byte[] ascii = s2.getBytes(StandardCharsets.US_ASCII);

Explanation 说明

First in java text (String/Reader/Writer) is already Unicode. Java文本(字符串/阅读器/编写器)中的第一个已经是Unicode。 For the java source code (String literals) the editor and the javac compiler should use the same encoding. 对于Java源代码(字符串文字),编辑器和javac编译器应使用相同的编码。 Ideally UTF-8. 理想情况下是UTF-8。

The normalizer splits into base letter and combining diacritical mark(s) and regular expression removes those marks. 归一化器拆分为基本字母,并结合变音标记和正则表达式删除这些标记。 Converting text with accents like ä é fi fl ĉ œ to ae fi fl c oe to ASCII. 将带有重音符号(例如ä é fi fl ĉ œ文本转换为ae fi fl c oe到ASCII。

Hence you would get - I think - "??? hello A" . 因此,您将获得-我认为- "??? hello A"

Charset ascii = StandardCharsets.US_ASCII;
String s2 = new String(s1.replaceAll(regex, "").getBytes(ascii), ascii);

To prevent receiving the question marks (and distinguishing between a ? in the original string), you can use a Charset.newDecoder() . 为了防止收到问号(并区分原始字符串中的? ),可以使用Charset.newDecoder()

For ASCII you would still need some transliteration to latin script. 对于ASCII,您仍然需要对拉丁脚本进行音译。

Answer 回答

As most Linux operating systems of newer origin already use UTF-8 as operating system default, you probably can simply do: 由于大多数较新版本的Linux操作系统已经使用UTF-8作为操作系统默认值,因此您可以简单地执行以下操作:

System.out.println("We are using encoding: " + System.getProperty("file.encoding"));
System.out.println(s);

Here s is converted to the operating system encoding. 此处s转换为操作系统编码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM