简体   繁体   English

在Java中对UTF-16字符串中的字符进行排序

[英]Sorting the characters in a UTF-16 string in Java

TLDR TLDR

Java uses two characters to represent UTF-16. Java使用两个字符来表示UTF-16。 Using Arrays.sort (unstable sort) messes with character sequencing. 使用Arrays.sort(不稳定排序)混乱与字符排序。 Should I convert char[] to int[] or is there a better way? 我应该将char []转换为int []还是有更好的方法?

Details 细节

Java represents a character as UTF-16. Java将字符表示为UTF-16。 But the Character class itself wraps char (16 bit). 但是Character类本身包含了char (16位)。 For UTF-16, it will be an array of two char s (32 bit). 对于UTF-16,它将是两个char (32位)的数组。

Sorting a string of UTF-16 characters using the inbuilt sort messes with data. 使用内置排序混乱数据对UTF-16字符串进行排序。 (Arrays.sort uses dual pivot quick sort and Collections.sort uses Arrays.sort to do the heavy lifting.) (Arrays.sort使用双枢轴快速排序,Collections.sort使用Arrays.sort来完成繁重的工作。)

To be specific, do you convert char[] to int[] or is there a better way to sort? 具体来说,你将char []转换为int []还是有更好的排序方法?

import java.util.Arrays;

public class Main {
    public static void main(String[] args) {
        int[] utfCodes = {128513, 128531, 128557};
        String emojis = new String(utfCodes, 0, 3);
        System.out.println("Initial String: " + emojis);

        char[] chars = emojis.toCharArray();
        Arrays.sort(chars);
        System.out.println("Sorted String: " + new String(chars));
    }
}

Output: 输出:

Initial String: 😁😓😭
Sorted String: ??😁??

I looked around for a bit and couldn't find any clean ways to sort an array by groupings of two elements without the use of a library. 我环顾四周,找不到任何干净的方法来通过两个元素的分组对数组进行排序而不使用库。

Luckily, the codePoints of the String are what you used to create the String itself in this example, so you can simply sort those and create a new String with the result. 幸运的是, StringcodePoints是你在这个例子中用来创建String本身的,所以你可以简单地对它们进行排序并用结果创建一个新的String

public static void main(String[] args) {
    int[] utfCodes = {128531, 128557, 128513};
    String emojis = new String(utfCodes, 0, 3);
    System.out.println("Initial String: " + emojis);

    int[] codePoints = emojis.codePoints().sorted().toArray();
    System.out.println("Sorted String: " + new String(codePoints, 0, 3));
}

Initial String: 😓😭😁 初始字符串:😓😭😁

Sorted String: 😁😓😭 排序字符串:😁😓😭

I switched the order of the characters in your example because they were already sorted. 我切换了示例中字符的顺序,因为它们已经排序了。

If you are using Java 8 or later, then this is a simple way to sort the characters in a string while respecting (not breaking) multi-char codepoints: 如果您使用的是Java 8或更高版本,那么这是一种在尊重(不破坏)多字符代码点的同时对字符串中的字符进行排序的简单方法:

int[] codepoints = someString.codePoints().sort().toArray();
String sorted = new String(codepoints, 0, codepoints.length);

Prior to Java 8, I think you either need to use a loop to iterate the code points in the original string, or use a 3rd-party library method. 在Java 8之前,我认为您需要使用循环来迭代原始字符串中的代码点,或者使用第三方库方法。


Fortunately, sorting the codepoints in a String is uncommon enough that the clunkyness and relative inefficiency of the solutions above are rarely a concern. 幸运的是,对String中的代码点进行排序是不常见的,以至于上述解决方案的笨重和相对低效率很少成为问题。

(When was the last time you tested for anagrams of emojis?) (你最后一次测试表情符号的时间是什么时候?)

We can't use char for Unicode, because Java's Unicode char handling is broken . 我们不能将char用于Unicode, 因为Java的Unicode字符处理被破坏了

In the early days of Java, Unicode code points were always 16-bits (fixed size at exactly one char). 在Java的早期,Unicode代码点总是16位(固定大小恰好是一个char)。 However, the Unicode specification changed to allow supplemental characters. 但是,Unicode规范已更改为允许补充字符。 That meant Unicode characters are now variable widths, and can be longer than one char. 这意味着Unicode字符现在是可变宽度,并且可以长于一个char。 Unfortunately, it was too late to change Java's char implementation without breaking a ton of production code. 不幸的是,在不破坏大量生产代码的情况下更改Java的char实现为时已晚。

So the best way to manipulate Unicode characters is by using code points directly, eg, using String.codePointAt(index) or the String.codePoints() stream on JDK 1.8 and above. 因此,操作Unicode字符的最佳方法是直接使用代码点,例如,在JDK 1.8及更高版本上使用String.codePointAt(index)String.codePoints()流。

Additional sources: 其他来源:

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM