简体   繁体   English

如何删除Java中的代理字符?

[英]How to remove surrogate characters in Java?

I am facing a situation where i get Surrogate characters in text that i am saving to MySql 5.1.我面临的情况是我在保存到 MySql 5.1 的文本中得到代理字符。 As the UTF-16 is not supported in this, I want to remove these surrogate pairs manually by a java method before saving it to the database.由于这里不支持 UTF-16,我想在将它们保存到数据库之前通过 java 方法手动删除这些代理对。

I have written the following method for now and I am curious to know if there is a direct and optimal way to handle this.我现在已经编写了以下方法,我很想知道是否有直接和最佳的方法来处理这个问题。

Thanks in advance for your help.在此先感谢您的帮助。

public static String removeSurrogates(String query) {
    StringBuffer sb = new StringBuffer();
    for (int i = 0; i < query.length() - 1; i++) {
        char firstChar = query.charAt(i);
        char nextChar = query.charAt(i+1);
        if (Character.isSurrogatePair(firstChar, nextChar) == false) {
            sb.append(firstChar);
        } else {
            i++;
        }
    }
    if (Character.isHighSurrogate(query.charAt(query.length() - 1)) == false
            && Character.isLowSurrogate(query.charAt(query.length() - 1)) == false) {
        sb.append(query.charAt(query.length() - 1));
    }

    return sb.toString();
}

Here's a couple things:这里有几件事:

  • Character.isSurrogate(char c) : Character.isSurrogate(char c) :

    A char value is a surrogate code unit if and only if it is either a low-surrogate code unit or a high-surrogate code unit. char 值是代理代码单元当且仅当它是低代理代码单元或高代理代码单元。

  • Checking for pairs seems pointless, why not just remove all surrogates?检查配对似乎毫无意义,为什么不删除所有代理?

  • x == false is equivalent to !x x == false等价于!x

  • StringBuilder is better in cases where you don't need synchronization (like a variable that never leaves local scope). StringBuilder在不需要同步的情况下更好(例如永远不会离开本地范围的变量)。

I suggest this:我建议这样做:

public static String removeSurrogates(String query) {
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < query.length(); i++) {
        char c = query.charAt(i);
        // !isSurrogate(c) in Java 7
        if (!(Character.isHighSurrogate(c) || Character.isLowSurrogate(c))) {
            sb.append(firstChar);
        }
    }
    return sb.toString();
}

Breaking down the if statement分解if语句

You asked about this statement:您询问了此声明:

if (!(Character.isHighSurrogate(c) || Character.isLowSurrogate(c))) {
    sb.append(firstChar);
}

One way to understand it is to break each operation into its own function, so you can see that the combination does what you'd expect:理解它的一种方法是将每个操作分解为它自己的函数,这样您就可以看到组合完成了您所期望的:

static boolean isSurrogate(char c) {
    return Character.isHighSurrogate(c) || Character.isLowSurrogate(c);
}

static boolean isNotSurrogate(char c) {
    return !isSurrogate(c);
}

...

if (isNotSurrogate(c)) {
    sb.append(firstChar);
}

Java strings are stored as sequences of 16-bit chars, but what they represent is sequences of unicode characters. Java 字符串存储为 16 位字符序列,但它们代表的是 unicode 字符序列。 In unicode terminology, they are stored as code units, but model code points.在 unicode 术语中,它们存储为代码单元,但模型代码点。 Thus, it's somewhat meaningless to talk about removing surrogates, which don't exist in the character / code point representation (unless you have rogue single surrogates, in which case you have other problems).因此,谈论删除字符/代码点表示中不存在的代理有点毫无意义(除非您有流氓单一代理,在这种情况下您会遇到其他问题)。

Rather, what you want to do is to remove any characters which will require surrogates when encoded.相反,您想要做的是删除任何在编码时需要代理的字符。 That means any character which lies beyond the basic multilingual plane.这意味着任何超出基本多语言平面的字符。 You can do that with a simple regular expression:你可以用一个简单的正则表达式来做到这一点:

return query.replaceAll("[^\u0000-\uffff]", "");

why not simply为什么不简单

for (int i = 0; i < query.length(); i++) 
    char c = query.charAt(i);
    if(!isHighSurrogate(c) && !isLowSurrogate(c))
        sb.append(c);

you probably should replace them with "?", instead of out right erasing them.您可能应该用“?”替换它们,而不是直接删除它们。

Just curious.只是好奇。 If char is high surrogate is there a need to check the next one?如果 char 是高代理,是否需要检查下一个? It is supposed to be low surrogate.它应该是低代理。 The modified version would be:修改后的版本是:

public static String removeSurrogates(String query) {
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < query.length(); i++) {
        char ch = query.charAt(i);
        if (Character.isHighSurrogate(ch))
            i++;//skip the next char is it's supposed to be low surrogate
        else
            sb.append(ch);
    }    
    return sb.toString();
}

if remove, all these solutions are useful but if repalce, below is better如果删除,所有这些解决方案都是有用的,但如果repalce,下面更好

StringBuffer sb = new StringBuffer();
    for (int i = 0; i < s.length(); i++) {
        char c = s.charAt(i);
        if(Character.isHighSurrogate(c)){
            sb.append('*');
        }else if(!Character.isLowSurrogate(c)){
            sb.append(c);
        }
    }
    return sb.toString();

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM