[英]convert unicode string to ASCII in java which works in unix/linux
I have already tried using Normalizer 我已经尝试使用Normalizer
String s = "口水雞 hello Ä";
String s1 = Normalizer.normalize(s, Normalizer.Form.NFKD);
String regex = Pattern.quote("[\\p{InCombiningDiacriticalMarks}\\p{IsLm}\\p{IsSk}]+");
String s2 = new String(s1.replaceAll(regex, "").getBytes("ascii"), "ascii");
System.out.println(s2);
System.out.println(s.length() == s2.length());
i want it to work in Unix/Linux , 我希望它可以在Unix / Linux中工作 ,
There is an ASCII character class for matching code points in the ASCII set: 有一个ASCII字符类,用于匹配ASCII集中的代码点:
String s = "口水雞 hello Ä";
String s1 = Normalizer.normalize(s, Normalizer.Form.NFKD);
String nonAscii = "[^\\p{ASCII}]+";
String s2 = s1.replaceAll(nonAscii, "");
System.out.println(s2);
System.out.println(s.length() == s2.length());
As Joop Eggan notes , Java string and char types are always UTF-16. 如Joop Eggan所述 ,Java字符串和char类型始终为UTF-16。 You can only have ASCII-encoded data in byte form: 您只能使用字节形式的ASCII编码数据:
byte[] ascii = s2.getBytes(StandardCharsets.US_ASCII);
Explanation 说明
First in java text (String/Reader/Writer) is already Unicode. Java文本(字符串/阅读器/编写器)中的第一个已经是Unicode。 For the java source code (String literals) the editor and the javac compiler should use the same encoding. 对于Java源代码(字符串文字),编辑器和javac编译器应使用相同的编码。 Ideally UTF-8. 理想情况下是UTF-8。
The normalizer splits into base letter and combining diacritical mark(s) and regular expression removes those marks. 归一化器拆分为基本字母,并结合变音标记和正则表达式删除这些标记。 Converting text with accents like ä é fi fl ĉ œ
to ae fi fl c oe
to ASCII. 将带有重音符号(例如ä é fi fl ĉ œ
文本转换为ae fi fl c oe
到ASCII。
Hence you would get - I think - "??? hello A"
. 因此,您将获得-我认为- "??? hello A"
。
Charset ascii = StandardCharsets.US_ASCII;
String s2 = new String(s1.replaceAll(regex, "").getBytes(ascii), ascii);
To prevent receiving the question marks (and distinguishing between a ?
in the original string), you can use a Charset.newDecoder()
. 为了防止收到问号(并区分原始字符串中的?
),可以使用Charset.newDecoder()
。
For ASCII you would still need some transliteration to latin script. 对于ASCII,您仍然需要对拉丁脚本进行音译。
Answer 回答
As most Linux operating systems of newer origin already use UTF-8 as operating system default, you probably can simply do: 由于大多数较新版本的Linux操作系统已经使用UTF-8作为操作系统默认值,因此您可以简单地执行以下操作:
System.out.println("We are using encoding: " + System.getProperty("file.encoding"));
System.out.println(s);
Here s
is converted to the operating system encoding. 此处s
转换为操作系统编码。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.