简体   繁体   English

转义Unicode替代字符?

[英]Escaping unicode surrogate characters?

I have the following line of text (see in code as well: 我有以下文本行(也请参见代码:

TEXT 文本

What I'm trying to do do is escape that emoticon (phone icon) as two \\u chars then back to its original phone icon? 我想做的是将表情符号(电话图标)转义为两个\\ u字符,然后返回其原始电话图标? The first method below works fine but I essentially want to escape by a range so that I can escape any chars like this. 下面的第一个方法可以正常工作,但我本质上是想按一定范围进行转义,以便可以转义任何这样的字符。 I don't know how this is possible using the first method below. 我不知道如何使用下面的第一种方法。

How can I achieve this range based escape using the UnicodeEscaper as the same output as StringEscapeUtils (ie escape to two \\uxx \\uxx then unescape back to phone icon)? 如何使用UnicodeEscaper作为与StringEscapeUtils相同的输出来实现基于范围的转义(即转义为两个\\ uxx \\ uxx,然后转义为电话图标)?

import org.apache.commons.lang3.text.translate.UnicodeEscaper;
import org.apache.commons.lang3.text.translate.UnicodeUnescaper;

    String text = "Unicode surrogate here-> 📱<--here";
    // escape the entire string...not what I want because there could
    // be \n \r or any other escape chars that I want left in tact (i just want  a range)
    String text2 = org.apache.commons.lang.StringEscapeUtils.escapeJava(text);
    System.out.println(text2);   // "Unicode surrogate here-> \uD83D\uDCF1<--here"
    // unescape it back to the phone emoticon
    text2 = org.apache.commons.lang.StringEscapeUtils.unescapeJava(text);
    System.out.println(text2); // "Unicode surrogate here-> 📱<--here"

    // How do I do the same as above but but looking for a range of chars to escape (i.e. any unicode surrogate)
    // , which is what i want  and not to escape the entire string
    text2 = UnicodeEscaper.between(0x10000, 0x10FFFF).translate(text);
    System.out.println(text2); // "Unicode surrogate here-> \u1F4F1<--here"
    // unescape .... (need the phone emoticon here)
    text2 = (new UnicodeUnescaper().translate(text2));
    System.out.println(text2);// "Unicode surrogate here-> ὏1<--here"

Too late answer. 答案太晚了。 But I've found you need 但是我发现你需要

org.apache.commons.lang3.text.translate.JavaUnicodeEscaper

class instead UnicodeEscaper. 类而不是UnicodeEscaper。

Using it, it prints: 使用它可以打印:

Unicode surrogate here-> \uD83D\uDCF1<--here

And the unescaping works well. 并且逃避效果很好。

Your string: 您的字符串:

"Unicode surrogate here-> \u1F4F1<--here"

does not do what you think it does. 不按照您的想法去做。

A char is basically a UTF-16 code unit, therefore 16 bits. char基本上是UTF-16代码单元,因此为16位。 So what happens here is that you have \ὁ 1 ; 所以这里发生的是您有\ὁ 1 ; and that explains your output. 这说明了您的输出。

I don't know what you call "escape" here, but if this is replacing surrogate pairs by "\\u\\u\u0026quot;, then have a look at Character.toChars() . 我不知道您在这里所说的“转义”,但是如果这用“ \\ u \\ u”代替了代理对,那么请看一下Character.toChars() It will return the char sequence necessary to represent one Unicode code point, whether it is in the BMP (one char) or not (two chars). 它将返回表示一个Unicode代码点所必需的char序列,无论它在BMP中(一个char)还是不在BMP中(两个char)。

For code point U+1f4f1, it will return a two-element char array with characters 0xd83d and 0xdcf1 in that order. 对于代码点U + 1f4f1,它将返回一个具有两个元素的char数组,该数组分别具有字符0xd83d和0xdcf1。 And this is what you want. 这就是您想要的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM