Translate UTF-8 character encoding function from PHP to Java

Question

I am trying to translate one PHP encoding function to Android Java method. Because Java string length function handles UTF-8 string differently. I failed to make the translated Java codes consistent with PHP code in converting the second UTF-8 str2. The first non UTF-8 string does work.

The original PHP codes are :

 function myhash_php($string,$key) {
    $strLen = strlen($string);
    $keyLen = strlen($key);
    $j=0 ; $hash = "" ; 
    for ($i = 0; $i < $strLen; $i++) {
        $ordStr = ord(substr($string,$i,1));
        if ($j == $keyLen) { $j = 0; }
        $ordKey = ord(substr($key,$j,1));
        $j++;
        $hash .= strrev(base_convert(dechex($ordStr + $ordKey),16,36));

    }
    return $hash;  
}
$str1 = "good friend" ;
$str2 = "好友" ;    //  strlen($str2) == 6
$key  = "iuyhjf476" ;
echo "php encode str1 '". $str1 ."'=".myhash_php($str1, $key)."<br>";
echo "php encode str2 '". $str2 ."'=".myhash_php($str2, $key)."<br>";

PHP output are:

    php encode str1 'good friend'=s5c6g6o5u3o5m4g4b4z516
    php encode str2 '好友'=a9u7m899x6p6

Current translated Java codes that produce wrong result are:

    public static String   hash_java(String  string, String  key) {
        //Integer strLen  = byteLenUTF8(string) ; // consistent with php strlen("好友")==6
        //Integer keyLen  = byteLenUTF8(key) ;    //   byteLenUTF8("好友") == 6
        Integer strLen  = string.length() ;      //     "好友".length()  ==  2
        Integer keyLen  = key.length() ;
        int j=0 ;
        String  hash = "" ;
        int ordStr, ordKey ;
        for (int i = 0; i < strLen; i++) {
            ordStr = ord_java(string.substring(i,i+1));  //string is String,  php  substr($string,$i,$n)  ==  java string.substring(i, i+n)
            // ordStr = ord_java(string[i]);  //string is byte[], php  substr($string,$i,$n)  ==  java string.substring(i, i+n)
            if (j == keyLen) { j = 0; }
            ordKey = ord_java(key.substring(j,j+1));
            j++;
            hash += strrev(base_convert(dechex(ordStr + ordKey),16,36));
        }
        return hash;
    }
    // return the ASCII code of the first character of str
    public static int      ord_java( String str){
        return( (int)  str.charAt(0)  ) ;
    }
    public static String   dechex(int input  ) {
        String hex  = Integer.toHexString(input ) ;
        return hex ;
    }
    public static String   strrev(String str){
        return  new StringBuilder(str).reverse().toString() ;
    }
    public static String   base_convert(String str, int fromBase, int toBase) {
        return Integer.toString(Integer.parseInt(str, fromBase), toBase);
    }

    String  str1 = "good friend" ;
    String  str2 = "好友" ;
    String  key  = "iuyhjf476" ;
    Log.d(LogTag,"java encode str1 '"+ str1  +"'="+hash_java(str1, key)) ;
    Log.d(LogTag,"java encode str2 '"+ str2  +"'="+hash_java(str2, key)) ;

Java output are:

java encode str1 'good friend'=s5c6g6o5u3o5m4g4b4z516
java encode str2 '好友'=arh4ng

The encoded output of UTF-8 str2 in Java method is not correct. How to fix the problem?

Answer 1

In Java, convert the string to a byte array, using UTF-8 character encoding. Then, apply your encoding algorithm to this byte array instead of the string.

Your PHP program seems to implicitly do the same thing, to treat eg the character好as three individual byte values, according to UTF-8 encoding.

EDIT:

In the comments, you say you receive the string from the user entering it on Android. So, you start with a Java String coming from some UI widget.

And you need that Java String to give the same result that the given PHP function will produce when fed with the same UTF-8 string. The resulting string will only use ASCII characters, so its character encoding is less problematic (doesn't matter whetherit's eg ISO-8859-1 or UTF-8).

The PHP string datatype is ignorant about the encoding, just stores a sequence of bytes, so in general it might contain ISO-8859-1 bytes where one byte represents one character, or UTF-8 byte sequences, where characters often occupy multiple bytes, or any other encoding. The PHP string does not know how the bytes are meant to be interpreted as characters, it just sees and counts bytes.

So, what your PHP string calls "characters", effectively is the bytes of the UTF-8 encoding, and the Java side must emulate this behaviour when doing its algorithm.

Java has a String data type very different from PHP, not based on byte sequences, but (mainly) seeing a string as a sequence of characters. So, if you work with the characters of the Java String, you'll not see the same sequence of elements that PHP sees.

When Java iterates over a String like "好友" , there are two steps, one for each of the two characters (seeing the character's Unicode code point number), while PHP has six steps, one for each byte of the UTF-8 representation, seeing the byte value.

So, to emulate PHP, in Java you have to convert the String to a byte[] array using UTF-8 encoding. This way, one Java byte will correspond to one PHP character.

Remark

By the way, the wording "UTF-8 string" does not make sense in Java.

That is different from PHP where eg "Maß" as ISO-8859-1 string (having a length of 3) differs from "Maß" as UTF-8 string (having a length of 4).

In Java, Strings are sequences of characters, and that's the reason why eg "好友" has a length of 2, as it's just two characters that happen to come from a non-Latin script. [This is true for most Unicode characters you'll typically encounter, but there are exceptions.] In Java, terms like UTF-8 matter only when converting between strings and byte sequences.

Answer 2

Do not use literals for testing - this is prone to yield unexpected results if not fully being aware of what you do and how the file is encoded. For UTF-8 you should everything treat as raw bytes and never use a String for en/decoding. Example in PHP:

$test1 = pack( 'H*', '414243' );  // "ABC" in hexadecimal: 2 digits per byte
$test2 = pack( 'H*', 'e5a5bde58f8b' );  // "好友" in hexadecimal, UTF-8 encoded, 3 bytes per character

Example in Java:

byte[] test1 = new byte[] { 0x41, 0x42, 0x43 };  // "ABC"
byte[] test2 = new byte[] { (byte)0xe5, (byte)0xa5, (byte)0xbd, (byte)0xe5, (byte)0x8f, (byte)0x8b };  // "好友"

Only this way you can make sure your test is set up correctly and unbound to how the source file is encoded. If your Java file is encoded in UTF-8 and your PHP file is encoded in UTF-16LE then you'd fail even worse, simply because you didn't separate between definition (raw bytes) and assumption (strings based on the text encoding) so far.

(This is also a big misunderstanding when people want to en/decrypt texts: they operate on (any programming language's) String rather than the actual bytes and then wonder why different results occur with a different programming language.)

Translate UTF-8 character encoding function from PHP to Java

Question

2 answers

solution1
0 2020-11-12 09:33:16

EDIT:

Remark

solution2
0 2020-11-12 21:45:07

Translate UTF-8 character encoding function from PHP to Java

Question

2 answers

solution1 0 2020-11-12 09:33:16

EDIT:

Remark

solution2 0 2020-11-12 21:45:07

solution1
0 2020-11-12 09:33:16

solution2
0 2020-11-12 21:45:07