Compare two strings that are lexicographically equivalent but not identical at the byte level

Question

I am looking for a way to compare two Java strings that are lexicographically equivalent but not identical at the byte level.

More precisely take the following file name "baaaé.png", at the byte level it can be represented in two different ways:

[98, 97, 97, 97, -61, -87 , 46, 112, 110, 103] --> the "é" is encoded with 2 bytes

[98, 97, 97, 97, 101, -52, -127 , 46, 112, 110, 103] --> the "é" is encoded with 3 bytes

    byte[] ch = {98, 97, 97, 97, -61, -87, 46, 112, 110, 103};
    byte[] ff = {98, 97, 97, 97, 101, -52, -127, 46, 112, 110, 103};

    String st = new String(ch,"UTF-8");
    String st2 = new String(ff,"UTF-8");
    System.out.println(st);
    System.out.println(st2);
    System.out.println(st.equals(st2));

Will generate the following output:

baaaé.png
baaaé.png
false

Is there a way to do the compare so that the equals method returns true ?

Answer 1

You can use the Collator class with an applicable strength to normalize out things like different accent marks. this will allow you to compare strings successsfully.

In this case, a US locale and a TERTIARY strength is enough to get the strings to be equal

Collator usCollator = Collator.getInstance();
usCollator.setStrength(Collator.TERTIARY);
System.out.println(usCollator.equals(st, st2));

outputs

true

You can also use Java's Normalizer class to convert between different forms of Unicode. This will transform your strings, but they will end up being the same, allowing you to use standard string tools to do the comparison

Finally, take might want to take a look at the ICU (International Components for Unicode) project, which provides lots of tools for working with Unicode strings in lots of different ways.

Answer 2

There are two kinds of Unicode normalization forms that you need to look into:

There first one is NFC vs. NFD. The example you give in your question is an excellent example of the different between NFC and NFD. Your first string is in NFC while your second one is in NFD.

In Unicode, many accented characters can be represented in two different ways: as the base character followed by a combining accent, or as a precomposed accented character. NFC uses precomposed characters when they area available. NFD always uses decomposed forms.

Normally we don't use a mix of NFC and NFD. Most environments specify which is the preferred form. Very briefly: MacOS X filenames use NFD, and pretty much everything else uses NFC. But if you're given input which might be in the "other" normalization form, you can easily convert it: the process is straightforward (using information provided by the Unicode character database) and lossless (ie you can go back and forth between NFC and NFD if you wish without losing information).

java provides a built in class called Normalizer that can convert a string to a given Unicode form.

There exist 2 other normalization forms: NFKC and NFKD. These forms are not intended for general use, but only for lexicographic comparisons. They account for the fact that, for example, ¼ should be considered the same as 1/4 in a search or comparison. But they do not imply that ¼ and 1/4 are the same or that one should generally be converted into the other.

The conversion from NFC to NFKC and from NFD to NFKD is again straightforward (you need the character database) but this time it is lossy. You need to keep the original NFC/NFD text and use the NFKC/NFKD only as a search/sort key.

Compare two strings that are lexicographically equivalent but not identical at the byte level

Question

2 answers

solution1
8 ACCPTED 2013-01-23 19:42:06

solution2
7 2013-01-23 19:44:33

Compare two strings that are lexicographically equivalent but not identical at the byte level

Question

2 answers

solution1 8 ACCPTED 2013-01-23 19:42:06

solution2 7 2013-01-23 19:44:33

solution1
8 ACCPTED 2013-01-23 19:42:06

solution2
7 2013-01-23 19:44:33