简体   繁体   中英

How to sort in Unicode code point (UTF8 or UTF32) sorted order in java?

Java's String.compareTo uses UTF16 sorted order.

List<String> inputValues = Arrays.asList("𝐴","figure", "flagship", "zion");
Collections.sort(inputValues);

Above code results into sorted order [zion,, figure, flagship] However, I want this sorted order to be [zion, figure, flagship, ] Note that some of the characters are ligatures.

[Maybe not everyone noticed that, what appears as a capital A is actually a:

Mathematical Italic Capital A (U+1D434)

]

Your problem is that in Java characters beyond the BMP are encoded as two characters.

To sort the list according to a codepoint-wise lexicographic order, you need to define your own Comparator :

public class CodePointComparator implements Comparator<String> {
 @Override
 public int compare(String o1, String o2) {
    int len1 = o1.length();
    int len2 = o2.length();
    int lim = Math.min(len1, len2);
    int k = 0;
    while (k < lim) {
      char c1 = o1.charAt(k);
      char c2 = o2.charAt(k);
      if (c1 != c2) {
        // A high surrogate is greater than a non-surrogate character
        if (Character.isHighSurrogate(c1) != Character.isHighSurrogate(c2)) {
          return Character.isHighSurrogate(c1) ? 1 : -1;
        }
        return c1 - c2;
      }
      k++;
    }
    return len1 - len2;
  }
}

and pass it as argument to the List#sort method. I operate directly on surrogate pairs to gain some performance.

Sorry, I am not looking for lexicographic sorting but simply sorting based on Unicode code point (UTF-8 or UTF-32).

There is a comment in one of the libraries that I am trying to use:

Input values (keys). These must be provided to Builder in Unicode code point (UTF8 or UTF32) sorted order. Note that sorting by Java's String.compareTo, which is UTF16 sorted order, is not correct and can lead to exceptions while building the FST

I was running into issues because I was using Collections.sort which is UTF-16 sorted order for Java. Finally I wrote my own compare function as below which resolves the issues I am facing. I am surprised that it is not available natively or with some other popular libraries.

public static void sort(List<String> list) {
    Collections.sort(
            list,
            new Comparator<String>() {
                @Override
                public int compare(String s1, String s2) {
                    int n1 = s1.length();
                    int n2 = s2.length();
                    int min = Math.min(n1, n2);
                    for (int i = 0; i < min; i++) {
                        int c1 = s1.codePointAt(i);
                        int c2 = s2.codePointAt(i);
                        if (c1 != c2) {
                            return c1 - c2;
                        }
                    }
                    return n1 - n2;
                }
            });
}

Easiest way:

inputValues.sort(String.CASE_INSENSITIVE_ORDER.reversed());



Little comples but with more control:

Convert the List into an array:

String[] arr = new String[inputValues .size()]; 
for (int i =0; i < inputValues .size(); i++) 
    arr[i] = inputValues.get(i); 

There are other efficient ways to convert List to array but this is the simple to understand!

Then use this function:

 public static String[] textSort(String[] words) {
    for (int i = 0; i < words.length; i++) {
        for (int j = i + 1; j < words.length; j++) {
            if (words[i].toUpperCase().compareTo(words[j].toUpperCase()) < 0) {//change this to > if you want to sort reverse order
                String temp = words[i];
                words[i] = words[j];
                words[j] = temp;
            }
        }
    }

    return words;
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM