简体   繁体   中英

How to pad Strings with Unicode characters in Java

I add right padding to a String to output it in a table format.

for (String[] tuple : testData) {
  System.out.format("%-32s -> %s\n", tuple[0], tuple[1]);
}

The result looks like this (random test data):

znZfmOEQ0Gb68taaNU6HY21lvo       -> Xq2aGqLedQnTSXg6wmBNDVb
frKweMCH8Kvgyk0J                 -> lHJ5r7YDV0jTL
NxtHP                            -> odvPJklwIzZZ
NX2scXjl5dxWmer                  -> wPDlKCKllVKk
x2HKsSHCqDQ                      -> RMuWLZ2vaP9sOF0yHmjVysJ
b0hryXKd6b80xAI                  -> 05MHjvTOxlxq1bvQ8RGe

This approach does not work when there are multi-byte unicode characters:

0OZot🇨🇳ivbyG🧷hZM1FI👡wNhn6r6cC -> OKDxDV1o2NMqXH3VvE7q3uONwEcY5V
fBHRCjU4K8OCdzACmQZSn6WO         -> gvGBtUO5a4gPMKj9BKqBHFKx1iO7
cDUh🇲🇺b0cXkLWkS                -> SZX
WtP9t                            -> Q0wWOeY3W66mM5rcQQYKpG
va4d🍷u8SS                       -> KI
a71?⚖TZ💣🧜‍♀🕓ws5J              -> b8A

As you can see, the alignment is off.

My idea was to calculate the difference between the length of the String and the number of bytes used and use that to offset the padding, something like this:

int correction = tuple[0].getBytes().length - tuple[0].length();

And then instead of padding to 32 chars, I would pad to 32 + correction . However, this didn't work either.

Here is my test code (using emoji-java but the behaviour should be reproducable with any unicode characters):

import java.util.Collection;
import org.apache.commons.lang3.RandomStringUtils;
import com.vdurmont.emoji.Emoji;
import com.vdurmont.emoji.EmojiManager;

public class Test {

  public static void main(String[] args) {
    // create random test data
    String[][] testData = new String[15][2];
    for (String[] tuple : testData) {
      tuple[0] = RandomStringUtils.randomAlphanumeric(2, 32);
      tuple[1] = RandomStringUtils.randomAlphanumeric(2, 32);
    }

    // add some emojis
    Collection<Emoji> all = EmojiManager.getAll();
    for (String[] tuple : testData) {
      for (int i = 1; i < tuple[0].length(); i++) {
        if (Math.random() > 0.90) {
          Emoji emoji = all.stream().skip((int) (all.size() * Math.random())).findFirst().get();
          tuple[0] = tuple[0].substring(0, i - 1) + emoji.getUnicode() + tuple[0].substring(i + 1);
        }
      }
    }

    // output
    for (String[] tuple : testData) {
      System.out.format("%-32s -> %s\n", tuple[0], tuple[1]);
    }
  }
}

There are actually a few issues here, other than that some fonts display the flag wider than the other characters. I assume that you want to count the Chinese flag as a single character (as it is drawn as a single element on the screen).

The String class reports an incorrect length

The String class works with char s, which are 16-bit integers of Unicode code points. The problem is that not all code points fit in 16 bits, only code points from the Basic Multilingual Plane (BMP) fit in those char s. String 's length() method returns the number of char s, not the number of code points.

Now String 's codePointCount method may help in this case: it counts the number of code points in the given index range. So providing string.length() as second argument to the method returns the total count of code points.

Combining characters

However, there's another problem. The 🇨🇳 Chinese flag, for example, consists of two Unicode code points : the Regional Indicator Symbol Letters C (🇨, U+1F1E8) and N (🇳, U+1F1F3). Those two code points are combined into a flag of China. This is a problem you are not going to solve with the codePointCount method.

The Regional Indicator Symbol Letters seem to be a special occasion. Two of those characters can be combined into a national flag. I am not aware of a standard way to achieve what you want. You may have to take that manually into account.

I've written a small program to get the length of a string.

static int length(String str) {
    String a = "\uD83C\uDDE6";
    String z = "\uD83C\uDDFF";

    Pattern p = Pattern.compile("[" + a + "-" + z + "]{2}");
    Matcher m = p.matcher(str);
    int count = 0;
    while (m.find()) {
        count++;
    }
    return str.codePointCount(0, str.length()) - count;
}

As is discussed by the comments in the question linked to by @Xehpuk, in this discussion on kotlinlang.org as well as in this blog post by Daniel Lemire the following seems to be correct:

The problem is that the java String class represents characters as UTF-16 characters. This means any unicode character that is represented by more than 16 bits is saved as 2 separate Char values. This fact is ignored by many of the functions within String, eg. String.lenght does not return the number of unicode characters, it returns the number of 16bit characters within the String, some emoji counting for 2 characters.

The behaviour, however, seems to be implementation-specific.

As David mentions in his post you could try the following to get the correct lenght:

tuple.codePointCount(0, tuple.length())

See code point methods from Java SE docs

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM