在Java中将unicode字符串转换为ASCII，可在unix / linux中使用

Question

I have already tried using Normalizer 我已经尝试使用Normalizer

String s = "口水雞 hello Ä";

String s1 = Normalizer.normalize(s, Normalizer.Form.NFKD);
String regex = Pattern.quote("[\\p{InCombiningDiacriticalMarks}\\p{IsLm}\\p{IsSk}]+");

String s2 = new String(s1.replaceAll(regex, "").getBytes("ascii"), "ascii");

System.out.println(s2);
System.out.println(s.length() == s2.length());

i want it to work in Unix/Linux , 我希望它可以在Unix / Linux中工作 ，

Answer 1

There is an ASCII character class for matching code points in the ASCII set: 有一个ASCII字符类，用于匹配ASCII集中的代码点：

String s = "口水雞 hello Ä";

String s1 = Normalizer.normalize(s, Normalizer.Form.NFKD);
String nonAscii = "[^\\p{ASCII}]+";
String s2 = s1.replaceAll(nonAscii, "");

System.out.println(s2);
System.out.println(s.length() == s2.length());

As Joop Eggan notes , Java string and char types are always UTF-16. 如Joop Eggan所述，Java字符串和char类型始终为UTF-16。 You can only have ASCII-encoded data in byte form: 您只能使用字节形式的ASCII编码数据：

byte[] ascii = s2.getBytes(StandardCharsets.US_ASCII);

Answer 2

Explanation 说明

First in java text (String/Reader/Writer) is already Unicode. Java文本（字符串/阅读器/编写器）中的第一个已经是Unicode。 For the java source code (String literals) the editor and the javac compiler should use the same encoding. 对于Java源代码（字符串文字），编辑器和javac编译器应使用相同的编码。 Ideally UTF-8. 理想情况下是UTF-8。

The normalizer splits into base letter and combining diacritical mark(s) and regular expression removes those marks. 归一化器拆分为基本字母，并结合变音标记和正则表达式删除这些标记。 Converting text with accents like ä é ﬁ ﬂ ĉ œ to ae fi fl c oe to ASCII. 将带有重音符号（例如ä é ﬁ ﬂ ĉ œ文本转换为ae fi fl c oe到ASCII。

Hence you would get - I think - "??? hello A" . 因此，您将获得-我认为- "??? hello A" 。

Charset ascii = StandardCharsets.US_ASCII;
String s2 = new String(s1.replaceAll(regex, "").getBytes(ascii), ascii);

To prevent receiving the question marks (and distinguishing between a ? in the original string), you can use a Charset.newDecoder() . 为了防止收到问号（并区分原始字符串中的? ），可以使用Charset.newDecoder() 。

For ASCII you would still need some transliteration to latin script. 对于ASCII，您仍然需要对拉丁脚本进行音译。

Answer 回答

As most Linux operating systems of newer origin already use UTF-8 as operating system default, you probably can simply do: 由于大多数较新版本的Linux操作系统已经使用UTF-8作为操作系统默认值，因此您可以简单地执行以下操作：

System.out.println("We are using encoding: " + System.getProperty("file.encoding"));
System.out.println(s);

Here s is converted to the operating system encoding. 此处s转换为操作系统编码。

在Java中将unicode字符串转换为ASCII，可在unix / linux中使用

问题描述

2 个解决方案

解决方案1
1 2014-06-26 08:26:53

解决方案2
0 2014-06-26 07:24:14

在Java中将unicode字符串转换为ASCII，可在unix / linux中使用

问题描述

2 个解决方案

解决方案1 1 2014-06-26 08:26:53

解决方案2 0 2014-06-26 07:24:14

解决方案1
1 2014-06-26 08:26:53

解决方案2
0 2014-06-26 07:24:14