hex-Encoding in Java goes wrong

Question

me and several experienced Java developers worked on this for like 1 hour now and we cannot get it to work. Someone has any tips for me?

Problem: We got a text in an Excel file which seems to be encoded completely inconsistent and stupid. Sometimes there are special chars, sometimes not, sometimes they are shown and interpreted differently.

What i wanted to do now is to write a little Java-Script, that checks the given Text in the Excel File and converts all the different Char-sequences into what we want it to be.

My Code:

       while (iterator.hasNext()) {
            Entity entity = (Entity) iterator.next();
            Dataset dataset = produkt_store.getDataset(entity);
            FormData formdata = dataset.getFormData();
            DomElement dom = (DomElement) formdata.get(lang,
                    "cs_description_short").get();
            String beschreibung = dom.toText(true);

            System.out.println("Before: " + beschreibung);
            String hexBeschreibung = StringToHex(beschreibung);
            String newHexBeschreibung = hexBeschreibung.replaceAll("75 3F", "FC");
            newHexBeschreibung = newHexBeschreibung.replaceAll("75 A8", "FC");
            //beschreibung2 = beschreibung2.replaceAll("75A8", "FC");
            System.out.println("After: " + HexToString(newHexBeschreibung));
            System.out.println(hexBeschreibung.equals(newHexBeschreibung) + "\n");

            // dom.set(beschreibung);
        }

Also i got those functions to encode / decode to hex:

    private static String StringToHex(String s) {
        if (s.length() == 0)
            return "";
        char c;
        StringBuffer buff = new StringBuffer();
        for (int i = 0; i < s.length(); i++) {
            c = s.charAt(i);
            buff.append(Integer.toHexString(c) + " ");
        }
        return buff.toString().trim();
    }

    private static String HexToString(String s) {
        if (s.length() == 0)
            return "";
        String[] arr = s.split(" ");
        StringBuffer buff = new StringBuffer();
        int i;
        for (String str : arr) {
            i = Integer.valueOf(str, 16).intValue();
            String hs = new Character((char) i).toString();
            buff.append(hs);
        }
        return buff.toString();
    }

Example:

Sometimes where there should be an "ü" it is shown as "u?" which we obviously want to avoid. When looking into it in an hex-Editor we see those things represented sometimes as 753F or 75A8. Same goes for "ä" or "ö" or "ß". So even for "u?" it varies from 753F to sometimes being 75A8. We tried to replace that with "ü". Doesn't work. Someone got any tips?

We tried to use String.replaceAll() before that and used something like String.replaceAll("u\\?","ü"); But that didn't work either as of nothing was changed at all.

Thanks for any tips on that encoding stuff! :)

EDIT:

This is the solution which works perfectly fine:

            beschreibung = beschreibung.replace("U\u0308", "\u00DC"); // "Ü"
            beschreibung = beschreibung.replace("u\u0308", "\u00FC"); // "ü"
            beschreibung = beschreibung.replace("A\u0308", "\u00C4"); // "Ä"
            beschreibung = beschreibung.replace("a\u0308", "\u00E4"); // "ä"
            beschreibung = beschreibung.replace("O\u0308", "\u00D6"); // "Ö"
            beschreibung = beschreibung.replace("o\u0308", "\u00F6"); // "ö"
            beschreibung = beschreibung.replace("s\u0308", "\u00DF"); // "ß"

Answer 1

Somewhere there was ü represented not as one char U-UMLAUT but as SMALL-LETTER-U followed by COMBING-DIACRITICAL-MARK-UMLAUT. This is valid.

Then there was some conversion back, to maybe ISO-8859-1 (or even US-ASCII?), and the Umlaut got separately converted. There was no such character in ISO-8859-1 and you got a question mark instead.

A repair afterwards would be:

String s = ...
s = s.replace("U?", "\u00DC")); // "Ü"
s = s.replace("u?", "\u00FC"); // "ü"
...

(I have escaped the chars to prevent problems with possibly different encoding of java compiler and editor. (Would be an error.)

That can also be done a bit more sophisticated:

s = s.replaceAll("([aouAOU])\\?", "$1\u0308"); // Again ASCII + Umlaut separately
s = TextNormalizer.normalize(s, TextNormalizer.Form.NFC);
// Now single non-ASCII letters.

The TextNormalizer might be a help here.

Caveat: The '?' can also be shown in a console (ie from the IDE), as there a conversion takes place too.

Somewhere a conversion was done. This can happen implicitly, where the encoding is optional and such. You might try with setting the system property file.encoding to UTF-8 or Cp1252 (Windows Latin-1).

Answer 2

First thing to check: are upper/lowercase important? eg if your toHex produces "75 3f" you won't replace it with your given command. hexBeschreibung = hexBeschreibung.toLowercase() would solve this issue.

Second: (more of a hint) "u?" doesn't mean 'u' + '?' , but 'u' + <not unicode character and definitly not '?'> .

I hope my first suggestion will help :)

--
Sorry I can't comment, so I have to edit:
Hex editors may show hex values upper or lower case, because it doesn't matter. You have to check your used String by yourself, because Java may represent hex in Strings with lowercase letters.

hex-Encoding in Java goes wrong

Question

2 answers

solution1
3 ACCPTED 2014-07-25 13:16:43

solution2
0 2014-07-25 12:53:52

hex-Encoding in Java goes wrong

Question

2 answers

solution1 3 ACCPTED 2014-07-25 13:16:43

solution2 0 2014-07-25 12:53:52

solution1
3 ACCPTED 2014-07-25 13:16:43

solution2
0 2014-07-25 12:53:52