简体   繁体   English

将 UTF8 字符串转换为 UCS-2 并替换 java 中的无效字符

[英]Transform UTF8 string to UCS-2 with replace invalid characters in java

I have a sting in UTF8:我有一个 UTF8 刺痛:

"Red🌹🌹Röses" “红🌹🌹玫瑰”

I need that to be converted to valid UCS-2(or fixed size UTF-16BE without BOM, they are the same things) encoding, so the output will be: "Red Röses" as the "🌹" out of range of UCS-2.我需要将其转换为有效的 UCS-2(或没有 BOM 的固定大小的 UTF-16BE,它们是相同的东西)编码,因此输出将是:“Red Röses”作为超出 UCS 范围的“🌹”- 2.

What I have tried:我尝试过的:

 @Test
public void testEncodeProblem() throws CharacterCodingException {
    String in = "Red\uD83C\uDF39\uD83C\uDF39Röses";
    ByteBuffer input = ByteBuffer.wrap(in.getBytes());

    CharsetDecoder utf8Decoder = StandardCharsets.UTF_16BE.newDecoder();
    utf8Decoder.onMalformedInput(CodingErrorAction.REPLACE);
    utf8Decoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
    utf8Decoder.replaceWith(" ");

    CharBuffer decoded = utf8Decoder.decode(input);

    System.out.println(decoded.toString()); //  剥擰龌맰龌륒쎶獥 
}

Nope.不。

    @Test
public void testEncodeProblem() {
    String in = "Red\uD83C\uDF39\uD83C\uDF39Röses";
    byte[] bytes = in.getBytes(StandardCharsets.UTF_16BE);
    String res = new String(bytes);
    System.out.println(res); //  Red�<�9�<�9Röses
}

Nope.不。

Note that "ö" is a valid UCS-2 symbol.请注意,“ö”是有效的 UCS-2 符号。

Any ideas/libraries?任何想法/图书馆?

Unfortunately, both snippets don't actually work, and that's because you misunderstand UTF-16 encoding.不幸的是,这两个片段实际上都不起作用,这是因为您误解了 UTF-16 编码。 UTF-16 CAN encode those emojis, it is NOT fixed width. UTF-16可以编码那些表情符号时,NOT宽度固定的。 There is no such thing as 'fixed with UTF-16 encoding'.没有“固定使用 UTF-16 编码”这样的东西。 There's.. UCS2.有.. UCS2。 Which is not UTF-16.这不是UTF-16。 The BE part doesn't make it 'fixed width', it merely locks in the endianness. BE 部分并没有使它成为“固定宽度”,它只是锁定字节序。 That is why both of these print the roses.这就是为什么这两者都印有玫瑰的原因。 Java unfortunately doesn't ship with a UCS2 encoding system, which makes this job harder, and uglier.不幸的是,Java 没有附带 UCS2 编码系统,这使得这项工作更加困难和丑陋。

Furthermore, Both snippets fail because you are calling forbidden methods.此外,两个片段都失败了,因为您正在调用被禁止的方法。

Anytime you convert bytes to characters or vice versa, character conversion IS happening .任何时候您将字节转换为字符,反之亦然,字符转换正在发生 You can't opt out of that.你不能选择退出。 A bunch of methods nevertheless somehow exist which do not take any parameter to indicate which charset encoding you'd like to use for that.尽管如此,还是存在一堆方法,它们不采用任何参数来指示您要为此使用哪种字符集编码。 These are the forbidden methods: These default to 'system default', and look like somehow somebody waved a magic wand and made it so that we can convert chars to bytes or vice versa without worrying about character encoding.这些是被禁止的方法: 这些默认为“系统默认值”,看起来就像有人挥舞着魔杖并制作了它,以便我们可以将字符转换为字节,反之亦然,而无需担心字符编码。

The solution is to never use the forbidden methods.解决方案是永远不要使用被禁止的方法。 Better yet, tell your IDE it should flag them as error.更好的是,告诉您的 IDE 它应该将它们标记为错误。 The only exceptions are where you KNOW the API defaults not to 'platform default', but to something sane - the only one I know of, is the Files.* API, which defaults to UTF-8 and not platform default.唯一的例外是您知道 API 默认不是“平台默认”,而是一些理智的东西 - 我唯一知道的是Files.* API,它默认为 UTF-8 而不是平台默认。 So, using the charset-less variants is acceptable there.因此,在那里使用无字符集的变体是可以接受的。

If you truly must have platform default (sensible for command line tools only), make it explicit by passing Charset.defaultCharset() .如果您确实必须拥有平台默认值(仅适用于命令行工具),请通过传递Charset.defaultCharset()使其明确。

The list of forbidden methods is quite long, but new String(bytes) and string.getBytes() are both on it.禁用方法列表很长,但new String(bytes)string.getBytes()都在上面。 Do not use these methods/constructors.不要使用这些方法/构造函数。 Ever .曾经

Furthermore your first snippet is all sorts of confused.此外,您的第一个片段是各种混淆。 You want to ENCODE a string (a string is already characters and has no encoding. It is what it is. So why are you making a decoder when there is nothing to decode?) to UTF-16, not decode it:你想编码字符串(。?字符串已经是字符,没有编码这是它是什么,为什么你在做一个解码器时,有什么可解码)为UTF-16,而不是对其进行解码:

String in = "Red\uD83C\uDF39\uD83C\uDF39Röses";
CharBuffer input = CharBuffer.wrap(in);
CharsetEncoder utf16Encoder = StandardCharsets.UTF_16BE.newEncoder();
utf16Encoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
utf16Encoder.replaceWith(" ");
ByteBuffer encoded = utf16Encoder.encode(input);

System.out.println(new String(encoded.array(), StandardCharsets.UTF16_BE));

or second snippet:或第二个片段:

@Test
public void testEncodeProblem() {
    String in = "Red\uD83C\uDF39\uD83C\uDF39Röses";
    byte[] bytes = in.getBytes(StandardCharsets.UTF_16BE);
    String res = new String(bytes, StandardCharsets.UTF_16BE);
    System.out.println(res);
}

But, as I said, both just print the roses, because those are representable in UTF_16.但是,正如我所说,两者都只打印玫瑰,因为它们可以用 UTF_16 表示。

So, how to get the job done?那么,如何完成工作呢? Had java had a UCS2 encoding built in, it'd be a simple as replacing StandardCharsets.UTF_16BE with StandardCharsets.UCS2 , but no such luck.有java的有建于UCS2编码,它会是一个简单的如更换StandardCharsets.UTF_16BEStandardCharsets.UCS2 ,但没有这样的运气。 So, I guess... probably 'by hand':所以,我想......可能是“手动”:

String in = "Red\uD83C\uDF39\uD83C\uDF39Röses";
ByteArrayOutputStream out = new ByteArrayOutputStream();
in.codePoints()
    .filter(a -> a < 65536)
    .forEach(a -> {
       out.write(a >> 8);
       out.write(a);
    });

// stream is ugly, but, because codePoints() was added in a time
// when oracle had just invented the shiny hammer, they are using it
// here for smearing butter on their sandwich. Silly geese. Oh well.

byte[] result = out.toByteArray();
// given that java has no way of reading UCS2, and UTF16BE doesn't fit,
// as there are chars representable in 2 bytes in UCS2 that take 3+ in
// UTF16BE, it's not possible to print this without another loop similar to above. 
// Let's just print the bytes and check em, by hand:

for (byte r : result) System.out.print(" " + (r & 0xFF));
System.out.println();
// For the roses string, printing with UTF-16BE does actually work,
// but it won't be true for all input strings...
System.out.println(new String(result, StandardCharsets.UTF_16BE));

yay!好极了! Success!成功!

NB: codePointAt could work and avoid the ugly stream here, but cPA's input isn't in 'codepoint index' but in 'char index' and that makes matters rather complicated;注意: codePointAt可以工作并避免这里的丑陋流,但是 cPA 的输入不在“代码点索引”中,而是在“字符索引”中,这使得事情变得相当复杂; you'd have to increment by 2 for any surrogate pair.对于任何代理对,您都必须增加 2。


Some introspection on unicode, UCS2, and UTF-16:对 unicode、UCS2 和 UTF-16 的一些内省:

Unicode is a gigantic table that maps any number between 0 and 1,112,064 (which is about 20 and a half bits) to a character, control concept, currency, punctuation, emoji, box drawing, or other characteresque concept. Unicode 是一个巨大的表格,它将 0 到 1,112,064(大约 20 位半)之间的任何数字映射到字符、控制概念、货币、标点符号、表情符号、方框图或其他字符概念。

An encoding like UTF-8 or US_ASCII defines a translation for some, or all, of these numbers into a series of bytes such that it can also be decoded back to a sequence of codepoints, which are commonly stored in 32-bits, because they don't fit in 16, and no architecture out there meaningfully deals in eg 24-bit or whatnot.像 UTF-8 或 US_ASCII 这样的编码定义了将这些数字中的一些或全部转换为一系列字节,这样它也可以解码回一系列代码点,这些代码点通常以 32 位存储,因为它们不适合 16 位,并且没有任何架构有意义地处理例如 24 位或诸如此类的问题。

In order to accomodate UCS2/UTF-16, there are NO characters in the unicode spec from 0xD800 to 0xDFFF, and that is intentional, and there never will be.为了适应 UCS2/UTF-16, unicode 规范中没有从 0xD800 到 0xDFFF 的字符,这是有意的,永远不会有。

This means UCS2 and UTF-16 are more or less the same thing, with one 'trick':这意味着 UCS2 和 UTF-16 或多或少是一回事,只有一个“技巧”:

For any unicode number that is below 65536 (so could theoretically fit in 2 bytes), for UTF-16 encoding (which CAN encode emoji and such), the UTF-16 encoding is just.. the number.对于任何低于 65536 的 unicode 数字(理论上可以容纳 2 个字节),对于 UTF-16 编码(可以对表情符号等进行编码),UTF-16 编码只是......数字。 straight up.直截了当。 As 2 bytes.作为 2 个字节。 D800-DFFF can't happen, because those codepoints are intentionally not a thing. D800-DFFF 不可能发生,因为那些代码点不是故意的。

For anything above 65536, that free block of D800 to DFFF is used in order to produce a so-called surrogate pair.对于 65536以上的任何内容,D800 到 DFFF 的空闲块用于生成所谓的代理对。 A second 'character' (a second block of 2 bytes) combine with the 11 bits of data we can store with our D800-DFFF range for a total of 16+11 = 27 bits, more than enough to cover the rest.第二个“字符”(第二个 2 字节块)与我们可以使用 D800-DFFF 范围存储的 11 位数据相结合,总共 16+11 = 27 位,足以覆盖其余部分。

Thus, UTF-16 will encode any unicode codepoint as either 2 bytes or 4 bytes.因此,UTF-16 会将任何 unicode 代码点编码为 2 个字节或 4 个字节。

UCS-2 as a term has mostly lost its meaning. UCS-2 作为一个术语已经失去了它的意义。 Originally, it meant exactly 2 bytes per 'character', no more and no less, and it still means that, but the meaning of 'a character' has been twisted beyond recognition: That rose?最初,它的意思是每个“字符”正好是 2 个字节,不多也不少,它仍然是这个意思,但是“一个字符”的含义已经被扭曲得面目全非:那玫瑰? It counts as 2 characters.它算作 2 个字符。 Try it in java - x.length() returns 2, not 1. A somewhat sane definition of UCS-2 as: 1 char really means 1 char, each char is represented by 2 bytes, and if you try to store a char that doesn't fit (would be a surrogate pair), well, those just cannot be encoded, so crash or apply the on-unreprestable-character-instead placeholder.在 java 中尝试 - x.length()返回 2,而不是 1。 UCS-2 有点理智的定义为:1 个字符实际上意味着 1 个字符,每个字符由 2 个字节表示,如果您尝试存储一个字符不适合(将是代理对),好吧,那些只是无法编码,因此崩溃或应用 on-unreprestable-character-instead 占位符。 Unfortunately, that's not (always) what UCS-2 means, which gets us back to having to write any code that applies this operation (discard / replace-with-placeholder any surrogate pairs so that length-in-bytes is exactly 2*number of codepoints) ourselves.不幸的是,这不是(总是)UCS-2 的意思,这让我们不得不重新编写任何应用此操作的代码(丢弃/替换为占位符任何代理对,以便字节长度恰好为 2*number代码点)我们自己。

Note that this surrogate pair stuff provides you with a different strategy, based on the fact that java's char is very close to the ideals of UCS2 (in that it is a 16-bit number, hardcoded in the java spec): You can just loop through all characters (as in, java's char ) and discard anything such that c >= 0xD800 && c < 0xE000 , as well as the immediately following character , which will get rid of the roses.请注意,基于 java 的char非常接近 UCS2 的理想(因为它是一个 16 位数字,在 java 规范中硬编码)这​​一事实,这个代理对为您提供了不同的策略:您可以循环通过所有字符(如 java 的char )并丢弃任何使得c >= 0xD800 && c < 0xE000以及紧随其后的字符,这将摆脱玫瑰。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM