[英]How to get an integer from an utf8mb4 character (😀 aka "\u{D83D}\u{DE00}") in PHP
I want to write a saslprep algorithm with php (I know there is a lib, I want to do it myself).我想用 php 写一个 saslprep 算法(我知道有一个 lib,我想自己做)。 One of my unit tests failes because the test vector
"\u{D83D}\u{DE00}"
aka我的一个单元测试失败了,因为测试向量
"\u{D83D}\u{DE00}"
又名fails to convert to code points (array of integer).
无法转换为代码点(整数数组)。
echo mb_ord("\u{D83D}\u{DE00}","UTF-32LE");
failes returning false失败返回错误
iconv("UTF-8","UTF-32LE","\u{D83D}\u{DE00}");
failes失败者
The expected result is 128512
预期结果是
128512
At first, let's analyze php
way of encoding:首先我们分析一下
php
的编码方式:
<?php
$sp = "ař\u{05FF}€😀\u{D83D}\u{DE00}";
echo 'Current PHP version: ' . phpversion() .
'; internal_encoding: ' . mb_internal_encoding() . PHP_EOL . PHP_EOL;
echo $sp . ' (' . strval(mb_strlen($sp)) . ' chars in ' .
strval(strlen($sp)) . ' bytes)' . PHP_EOL . PHP_EOL;
$sp_array = mb_str_split($sp, 1, mb_internal_encoding() );
echo strval( sizeof($sp_array)) . PHP_EOL;
foreach ($sp_array as $char) {
$cars = bin2hex($char);
echo str_pad($cars, 8, ' ', STR_PAD_LEFT) .
' (' . strval(strlen($char)) . ' bytes) ' .
str_pad(implode(",", mb_str_split($cars, 2) ), 12) . // WTF-8
str_pad('->' . strval(mb_ord($char)) . '<-', 10) .
' 0x' . str_pad( dechex( mb_ord($char)), 6) .
$char . "\t" . IntlChar::charName($char) . PHP_EOL;
}
?>
Output : shows that surrogate code points are encoded as WTF-8 (Wobbly Transformation Format − 8-bit) : Output :显示代理代码点被编码为WTF-8(摇摆变换格式 - 8 位) :
71249878parsing.php
Current PHP version: 8.1.2; internal_encoding: UTF-8 ař€���� (7 chars in 18 bytes) 7 61 (1 bytes) 61 ->97<- 0x61 a LATIN SMALL LETTER A c599 (2 bytes) c5,99 ->345<- 0x159 ř LATIN SMALL LETTER R WITH CARON d7bf (2 bytes) d7,bf ->1535<- 0x5ff e282ac (3 bytes) e2,82,ac ->8364<- 0x20ac € EURO SIGN f09f9880 (4 bytes) f0,9f,98,80 ->128512<- 0x1f600 GRINNING FACE eda0bd (3 bytes) ed,a0,bd -><- 0x0 �� edb880 (3 bytes) ed,b8,80 -><- 0x0 ��
Now we can write the following functions and combine them to get desired number:现在我们可以编写以下函数并将它们组合起来以获得所需的数字:
function CodepointFromWTF8
: decode from well-formed WTF-8
to code points ; function CodepointFromWTF8
:从格式良好的WTF-8
解码为代码点;function CodepointFromSurrogates
: decode from potentially ill-formed UTF-16
to code points . function CodepointFromSurrogates
: 从可能格式错误的UTF-16
解码为代码点。 The following formula should suffice for a well-formed UTF-16
surrogate pair : codepoint = 0x10000 + ((high - 0xD800) << 10) + (low - 0xDC00)
UTF-16
代理对: codepoint = 0x10000 + ((high - 0xD800) << 10) + (low - 0xDC00)
BTW: Tested using "ař\u{05FF}€\u{D83D}\u{DE00}"
sample string where characters are as follows (column CodePoint
contains Unicode ( U+hhhh
) and WTF-8
bytes and column Description
contains surrogates in parentheses, if apply):顺便说一句:使用
"ař\u{05FF}€\u{D83D}\u{DE00}"
示例字符串进行测试,其中字符如下( CodePoint
列包含 Unicode( U+hhhh
)和WTF-8
字节, Description
列包含代理项在括号中,如果适用):
Char CodePoint Description
---- --------- -----------
a {U+0061, 0x61} Latin Small Letter A
ř {U+0159, 0xC5,0x99} Latin Small Letter R With Caron
{U+05FF, 0xD7,0xBF} Undefined
€ {U+20AC, 0xE2,0x82,0xAC} Euro Sign
😀 {U+1F600, 0xF0,0x9F,0x98,0x80} GRINNING FACE (0xd83d,0xde00)
{U+D83D, 0xED,0xA0,0xBD} Non Private Use High Surrogate
{U+DE00, 0xED,0xB8,0x80} Low Surrogate
Here's full simplified solution (I don't follow Userland Naming Guide arbitrarily mixing snake_case , camelCase and PascalCase rules, sorry):这是完整的简化解决方案(我不遵循Userland Naming Guide任意混合snake_case , camelCase和PascalCase规则,抱歉):
<?php
function CodepointFromWTF8($ch) {
$Bytes = str_split($ch);
switch (strlen($ch)) {
case 1:
$retval = ord($Bytes[0]) & 0x7F;
break;
case 2:
$retval = (ord(($Bytes[0]) & 0x1F) << 6) +
(ord($Bytes[1]) & 0x3F);
break;
case 3:
$retval = ((ord($Bytes[0]) & 0x0F) << 12) +
((ord($Bytes[1]) & 0x3F) << 6) +
(ord($Bytes[2]) & 0x3F);
break;
case 4:
$retval = ((ord($Bytes[0]) & 0x07) << 18) +
((ord($Bytes[1]) & 0x3F) << 12) +
((ord($Bytes[2]) & 0x3F) << 6) +
(ord($Bytes[3]) & 0x3F);
break;
default:
$retval = 0;
}
return $retval;
}
Function IsHighSurrogate ($cp) {
return ($cp >= 0xD800 and $cp <= 0xDBFF);
}
Function IsLowSurrogate ($cp) {
return ($cp >= 0xDC00 and $cp <= 0xDFFF);
}
Function CodepointFromSurrogates($Surrogates) {
$cps = array();
$cpsc = count($Surrogates);
for ( $ii = 0; $ii < $cpsc; $ii+=1 ) {
if ( ( IsHighSurrogate( $Surrogates[$ii]) ) and
( 1 + $ii ) < $cpsc and
( IsLowSurrogate( $Surrogates[1+$ii]) ) )
{
array_push($cps, 0x10000 + (
($Surrogates[$ii] - 0xD800) << 10) + ($Surrogates[$ii+1] - 0xDC00) );
$ii+=1;
} else {
array_push($cps, $Surrogates[$ii] );
}
}
return $cps;
}
$sp = "ař\u{05FF}€😀\u{D83D}\u{DE00}";
echo 'Current PHP version: ' . phpversion() .
'; internal_encoding: ' . mb_internal_encoding() . PHP_EOL . PHP_EOL;
echo $sp . ' (' . strval(mb_strlen($sp)) . ' chars in ' .
strval(strlen($sp)) . ' bytes)' . PHP_EOL . PHP_EOL;
$sp_array = mb_str_split($sp, 1, mb_internal_encoding() );
echo strval( sizeof($sp_array)) . PHP_EOL;
$sp_codepoints = array();
foreach ($sp_array as $char) {
$cars = bin2hex($char);
$charord = mb_ord($char);
// echo str_pad($cars, 8, ' ', STR_PAD_LEFT) .
// ' (' . strval(strlen($char)) . ' bytes) ' .
// str_pad(implode(",", mb_str_split($cars, 2) ), 12) . // WTF-8
// str_pad('->' . strval($charord) . '<-', 10) .
// ' 0x' . str_pad( dechex( $charord), 6) .
// str_pad( gettype($charord), 8) .
// $char . "\t" . IntlChar::charName($char) .
// PHP_EOL;
if (gettype($charord) == "boolean") {
$charord = CodepointFromWTF8($char);
}
array_push($sp_codepoints, $charord);
}
print_r($sp_codepoints);
$sp_codepoints_real = CodepointFromSurrogates($sp_codepoints);
print_r($sp_codepoints_real);
?>
Result : 71249878.php
结果:
71249878.php
Current PHP version: 8.1.2; internal_encoding: UTF-8
ař€😀���� (7 chars in 18 bytes)
7
Array
(
[0] => 97
[1] => 345
[2] => 1535
[3] => 8364
[4] => 128512
[5] => 55357
[6] => 56832
)
Array
(
[0] => 97
[1] => 345
[2] => 1535
[3] => 8364
[4] => 128512
[5] => 128512
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.