如何从 PHP 中的 utf8mb4 字符（又名“\u{D83D}\u{DE00}”）获取 integer

Question

I want to write a saslprep algorithm with php (I know there is a lib, I want to do it myself).我想用 php 写一个 saslprep 算法（我知道有一个 lib，我想自己做）。 One of my unit tests failes because the test vector "\u{D83D}\u{DE00}" aka我的一个单元测试失败了，因为测试向量"\u{D83D}\u{DE00}"又名fails to convert to code points (array of integer).无法转换为代码点（整数数组）。

echo mb_ord("\u{D83D}\u{DE00}","UTF-32LE");

failes returning false失败返回错误

iconv("UTF-8","UTF-32LE","\u{D83D}\u{DE00}");

failes失败者

The expected result is 128512预期结果是128512

Answer 1

At first, let's analyze php way of encoding:首先我们分析一下php的编码方式：

<?php
$sp = "ař\u{05FF}€😀\u{D83D}\u{DE00}";
echo 'Current PHP version: ' . phpversion() .
     '; internal_encoding: ' . mb_internal_encoding() . PHP_EOL . PHP_EOL;
echo $sp . ' (' . strval(mb_strlen($sp)) . ' chars in ' . 
                  strval(strlen($sp)) . ' bytes)' . PHP_EOL . PHP_EOL;

$sp_array = mb_str_split($sp, 1, mb_internal_encoding() );
echo strval( sizeof($sp_array)) . PHP_EOL;
foreach ($sp_array as $char) {
    $cars = bin2hex($char);
    echo str_pad($cars, 8, ' ', STR_PAD_LEFT) .
        ' (' . strval(strlen($char)) . ' bytes) ' .
        str_pad(implode(",", mb_str_split($cars, 2) ), 12) .   // WTF-8
        str_pad('->' . strval(mb_ord($char)) . '<-', 10) . 
        ' 0x' . str_pad( dechex( mb_ord($char)), 6) .
        $char . "\t" . IntlChar::charName($char) . PHP_EOL;
}
?>

Output : shows that surrogate code points are encoded as WTF-8 (Wobbly Transformation Format − 8-bit) : Output ：显示代理代码点被编码为WTF-8（摇摆变换格式 - 8 位）：

71249878parsing.php

 Current PHP version: 8.1.2; internal_encoding: UTF-8 ař׿€���� (7 chars in 18 bytes) 7 61 (1 bytes) 61 ->97<- 0x61 a LATIN SMALL LETTER A c599 (2 bytes) c5,99 ->345<- 0x159 ř LATIN SMALL LETTER R WITH CARON d7bf (2 bytes) d7,bf ->1535<- 0x5ff ׿ e282ac (3 bytes) e2,82,ac ->8364<- 0x20ac € EURO SIGN f09f9880 (4 bytes) f0,9f,98,80 ->128512<- 0x1f600 GRINNING FACE eda0bd (3 bytes) ed,a0,bd -><- 0x0 �� edb880 (3 bytes) ed,b8,80 -><- 0x0 ��

Now we can write the following functions and combine them to get desired number:现在我们可以编写以下函数并将它们组合起来以获得所需的数字：

function CodepointFromWTF8 : decode from well-formed WTF-8 to code points ; function CodepointFromWTF8 ：从格式良好的WTF-8解码为代码点；
function CodepointFromSurrogates : decode from potentially ill-formed UTF-16 to code points . function CodepointFromSurrogates ：从可能格式错误的UTF-16解码为代码点。 The following formula should suffice for a well-formed UTF-16 surrogate pair : codepoint = 0x10000 + ((high - 0xD800) << 10) + (low - 0xDC00)以下公式足以满足格式正确的UTF-16代理对： codepoint = 0x10000 + ((high - 0xD800) << 10) + (low - 0xDC00)

BTW: Tested using "ař\u{05FF}€\u{D83D}\u{DE00}" sample string where characters are as follows (column CodePoint contains Unicode ( U+hhhh ) and WTF-8 bytes and column Description contains surrogates in parentheses, if apply):顺便说一句：使用"ař\u{05FF}€\u{D83D}\u{DE00}"示例字符串进行测试，其中字符如下（ CodePoint列包含 Unicode（ U+hhhh ）和WTF-8字节， Description列包含代理项在括号中，如果适用）：

Char CodePoint                      Description
---- ---------                      -----------
   a {U+0061, 0x61}                 Latin Small Letter A
   ř {U+0159, 0xC5,0x99}            Latin Small Letter R With Caron
   ׿ {U+05FF, 0xD7,0xBF}            Undefined
   € {U+20AC, 0xE2,0x82,0xAC}       Euro Sign
  😀 {U+1F600, 0xF0,0x9F,0x98,0x80} GRINNING FACE (0xd83d,0xde00)
     {U+D83D, 0xED,0xA0,0xBD}       Non Private Use High Surrogate
     {U+DE00, 0xED,0xB8,0x80}       Low Surrogate

Edit编辑

Here's full simplified solution (I don't follow Userland Naming Guide arbitrarily mixing snake_case , camelCase and PascalCase rules, sorry):这是完整的简化解决方案（我不遵循Userland Naming Guide任意混合snake_case ， camelCase和PascalCase规则，抱歉）：

<?php

function CodepointFromWTF8($ch) {
    $Bytes = str_split($ch);
    switch (strlen($ch)) {
    case 1:
        $retval = ord($Bytes[0]) & 0x7F;
        break;
    case 2:
        $retval = (ord(($Bytes[0]) & 0x1F) << 6) +
                  (ord($Bytes[1]) & 0x3F);
        break;
    case 3:
        $retval = ((ord($Bytes[0]) & 0x0F) << 12) +
                  ((ord($Bytes[1]) & 0x3F) << 6)  +
                   (ord($Bytes[2]) & 0x3F);
        break;
    case 4:
        $retval = ((ord($Bytes[0]) & 0x07) << 18) +
                  ((ord($Bytes[1]) & 0x3F) << 12) +
                  ((ord($Bytes[2]) & 0x3F) << 6)  +
                   (ord($Bytes[3]) & 0x3F);
        break;
    default:
        $retval = 0;
    }
    return $retval;
}

Function IsHighSurrogate ($cp) {
    return ($cp >= 0xD800 and  $cp <= 0xDBFF);
}

Function IsLowSurrogate ($cp) {
    return ($cp >= 0xDC00 and  $cp <= 0xDFFF);
}

Function CodepointFromSurrogates($Surrogates) {
    $cps = array();
    $cpsc = count($Surrogates);
    for ( $ii = 0; $ii < $cpsc; $ii+=1 ) {
        if ( ( IsHighSurrogate( $Surrogates[$ii]) ) and
             ( 1 + $ii ) < $cpsc and
             ( IsLowSurrogate( $Surrogates[1+$ii]) ) )
        {
            array_push($cps, 0x10000 + (
                ($Surrogates[$ii] - 0xD800) << 10) + ($Surrogates[$ii+1] - 0xDC00) );
            $ii+=1;
        } else {
            array_push($cps, $Surrogates[$ii] );
        }
    }
    return $cps;
}

$sp = "ař\u{05FF}€😀\u{D83D}\u{DE00}";
echo 'Current PHP version: ' . phpversion() .
     '; internal_encoding: ' . mb_internal_encoding() . PHP_EOL . PHP_EOL;
echo $sp . ' (' . strval(mb_strlen($sp)) . ' chars in ' . 
                  strval(strlen($sp)) . ' bytes)' . PHP_EOL . PHP_EOL;

$sp_array = mb_str_split($sp, 1, mb_internal_encoding() );
echo strval( sizeof($sp_array)) . PHP_EOL;
$sp_codepoints = array();
foreach ($sp_array as $char) {
    $cars = bin2hex($char);
    $charord = mb_ord($char);
//     echo str_pad($cars, 8, ' ', STR_PAD_LEFT) .
//         ' (' . strval(strlen($char)) . ' bytes) ' .
//         str_pad(implode(",", mb_str_split($cars, 2) ), 12) .   // WTF-8
//         str_pad('->' . strval($charord) . '<-', 10) . 
//         ' 0x' . str_pad( dechex( $charord), 6) .
//         str_pad( gettype($charord), 8) .
//         $char . "\t" . IntlChar::charName($char) . 
//         PHP_EOL;
    if (gettype($charord) == "boolean") {
        $charord = CodepointFromWTF8($char);
    }
    array_push($sp_codepoints, $charord);
}
print_r($sp_codepoints);

$sp_codepoints_real = CodepointFromSurrogates($sp_codepoints);
print_r($sp_codepoints_real);
?>

Result : 71249878.php结果： 71249878.php

Current PHP version: 8.1.2; internal_encoding: UTF-8

ař׿€😀���� (7 chars in 18 bytes)

7
Array
(
    [0] => 97
    [1] => 345
    [2] => 1535
    [3] => 8364
    [4] => 128512
    [5] => 55357
    [6] => 56832
)
Array
(
    [0] => 97
    [1] => 345
    [2] => 1535
    [3] => 8364
    [4] => 128512
    [5] => 128512
)

如何从 PHP 中的 utf8mb4 字符（又名“\u{D83D}\u{DE00}”）获取 integer

问题描述

1 个解决方案

解决方案1
0 2022-02-26 22:01:12

Edit编辑

如何从 PHP 中的 utf8mb4 字符（又名“\u{D83D}\u{DE00}”）获取 integer

问题描述

1 个解决方案

解决方案1 0 2022-02-26 22:01:12

Edit编辑

解决方案1
0 2022-02-26 22:01:12