简体   繁体   English

将Unicode转换为ASCII而不更改字符串长度(在Java中)

[英]Convert Unicode to ASCII without changing the string length (in Java)

What is the best way to convert a string from Unicode to ASCII without changing it's length (that is very important in my case)? 将字符串从Unicode转换为ASCII而不改变其长度的最佳方法是什么(在我的情况下这非常重要)? Also the characters without any conversion problems must be at the same positions as in the original string. 此外,没有任何转换问题的字符必须与原始字符串中的位置相同。 So an "Ä" must be converted to "A" and not something cryptic that has more characters. 因此,“Ä”必须转换为“A”而不是具有更多字符的神秘内容。

Edit: 编辑:
@novalis - Such symbols (for example of asian languages) should just be converted to some placeholders. @novalis - 这些符号(例如亚洲语言)应该只转换为一些占位符。 I am not too interested in those words or what they mean. 我对这些词或他们的意思不太感兴趣。

@MtnViewMark - I must preserve the number of all characters and the position of ASCII available characters under any circumstance. @MtnViewMark - 在任何情况下我都必须保留所有字符的数量和ASCII可用字符的位置。

Here some more info: I have some text mining tools that can only process ASCII strings. 这里有一些更多信息:我有一些只能处理ASCII字符串的文本挖掘工具。 Most of the text that should be processed is in English, but some do contain non ASCII characters. 大多数应该处理的文本是英文的,但有些文本包含非ASCII字符。 I am not interested in those words, but I must be sure that the words I am interested in (those that only contain ASCII characters) are at the same positions after the string conversion. 我对这些单词不感兴趣,但我必须确保我感兴趣的单词(那些只包含ASCII字符的单词)在字符串转换后处于相同的位置。

As stated in this answer, the following code should work: 回答所述,以下代码应该有效:

    String s = "口水雞 hello Ä";

    String s1 = Normalizer.normalize(s, Normalizer.Form.NFKD);
    String regex = "[\\p{InCombiningDiacriticalMarks}\\p{IsLm}\\p{IsSk}]+";

    String s2 = new String(s1.replaceAll(regex, "").getBytes("ascii"), "ascii");

    System.out.println(s2);
    System.out.println(s.length() == s2.length());

Output is 输出是

??? hello A
true

So you first remove diactrical marks, the convert to ascii. 所以你首先删除diactrical标记,转换为ascii。 Non-ascii characters will become question marks. 非ascii字符将成为问号。

java.text.Normalizer.normalize()Normalizer.Form.NFD ,然后过滤掉非ASCII字符。

As Paul Taylor mentioned: there is issue with using Normalizer if you need the project to be compilable/runnable in pre-1.6 and also in 1.6 and higher java. 正如Paul Taylor所说:如果您需要在1.6之前的版本中以及1.6及更高版本的java中可编译/可运行项目,则使用Normalizer存在问题。 You will get into troubles since Normalizer is in different packages ( java.text.Normalizer (for 1.6) instead of sun.text.Normalizer (for pre-1.6)) and has different method-signature. 你会遇到麻烦,因为Normalizer在不同的包中( java.text.Normalizer (用于1.6)而不是sun.text.Normalizer (用于1.6之前的版本))并且具有不同的方法签名。

Usually it is recommended to use reflection to invoke appropriate Normalizer.normalize() method. 通常建议使用反射来调用相应的Normalizer.normalize()方法。 ( Example could be found here ). 例子可以在这里找到 )。
But if you don't want to put reflection-mess in your code, you can use icu4j library . 但是,如果您不想在代码中添加反射混乱,则可以使用icu4j库 It contains com.ibm.icu.text.Normalizer class with normalize() method that perform the same job as java.text.Normalizer/sun.text.Normalizer. 它包含带有normalize()方法的com.ibm.icu.text.Normalizer类,该方法执行与java.text.Normalizer / sun.text.Normalizer相同的工作。 Icu library has (should have) own implementation of Normalizer so you can share your project with library and that should be java-independent. Icu库有(应该有)自己的Normalizer实现,因此您可以与库共享您的项目,这应该是独立于Java的。
Disadvantage is that the icu library is quite big. 缺点是icu库非常大。

If you using Normalizer class just for removing accents/diacritics from Strings, there's also another way. 如果您使用Normalizer类只是为了从Strings中删除重音符号/变音符号,那么还有另一种方法。 You can use Apache commons lang library (ver. 3) that contains StringUtils with method stripAccents() : 您可以使用包含StringUtils Apache commons lang库(版本3)和方法stripAccents()

String noAccentsString = org.apache.commons.lang3.StringUtils.stripAccents(s);

Lang3 library probably use reflection to invoke appropriate Normalizer according to java version. Lang3库可能使用反射来根据java版本调用适当的Normalizer。 So advantage is that you don't have reflection mess in your code. 所以优点是您的代码中没有反射混乱。

Caveat: I don't know Java. 警告:我不懂Java。 Just a bit about character sets. 只是关于字符集。

You are not stating which character set you are using exactly. 您没有说明您正在使用哪个字符集。

But no matter which you use, it's impossible to convert a Unicode string to ASCII and retain the original length and character positions, simply because a Unicode character set will use multiple bytes for some characters (obviously). 但无论您使用哪种,都不可能将Unicode字符串转换为ASCII 保留原始长度和字符位置,因为Unicode字符集将为某些字符使用多个字节(显然)。

The only exception I know of would be a UTF-8 string that contains only ASCII characters: This string will already be identical in both UTF-8 and ASCII, because UTF-8 uses multibyte characters only when necessary. 我所知道的唯一例外是UTF-8字符串,它只包含ASCII字符:这个字符串在UTF-8和ASCII中已经相同,因为UTF-8仅在必要时使用多字节字符。 (I don't know about the other Unicode flavours, there may be other dynamic ones). (我不知道其他Unicode风格,可能还有其他动态风格)。

The only workaround I can see is adding a space to any special character that was replaced by an ASCII one, but that will screw up the string ( Göteborg in UTF8 would have to become Go teborg to keep the length). 我能看到的唯一解决方法是为任何被ASCII替换的特殊字符添加一个空格,但这会Go teborg字符串(UTF8中的Göteborg必须成为Go teborg以保持长度)。

Maybe you want to elaborate on what you want to / need to achieve, so people here can suggest workarounds. 也许您想详细说明您想要/需要实现的目标,因此这里的人们可以建议解决方法。

One isssue with Normalizer is that pre Java 1.6 its in sun.text package whereas in 1.6 its in java.text package and it method signature has changed. 一个是使用Normalizer,它是pre Java 1.6,它在sun.text包中,而在1.6中它在java.text包中,它的方法签名已经改变。 So if your application neeeds to run on both platforms you'll have to use reflection. 因此,如果您的应用程序需要在两个平台上运行,则必须使用反射。

An alternative custom solution is described as techniwue 3 here 备选自定义解决方案被描述为techniwue 3 这里

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM