將 unicode 轉換為 html 實體十六進制

Question

如何將 Unicode 字符串轉換為 HTML 實體？ （ HEX不是十進制）

例如，將Français轉換為Français 。

Answer 1

對於相關問題中缺少的十六進制編碼：

$output = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($match) {
    list($utf8) = $match;
    $binary = mb_convert_encoding($utf8, 'UTF-32BE', 'UTF-8');
    $entity = vsprintf('&#x%X;', unpack('N', $binary));
    return $entity;
}, $input);

這類似於@Baba使用UTF-32BE的答案，然后unpack和vsprintf以滿足格式化需求。

如果您更喜歡iconv不是mb_convert_encoding ，它是類似的：

$output = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($match) {
    list($utf8) = $match;
    $binary = iconv('UTF-8', 'UTF-32BE', $utf8);
    $entity = vsprintf('&#x%X;', unpack('N', $binary));
    return $entity;
}, $input);

我發現這個字符串操作比Get hexcode of html entity更清楚一些。

Answer 2

您的字符串看起來像UCS-4編碼，您可以嘗試

$first = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($m) {
    $char = current($m);
    $utf = iconv('UTF-8', 'UCS-4', $char);
    return sprintf("&#x%s;", ltrim(strtoupper(bin2hex($utf)), "0"));
}, $string);

輸出

string 'Fran&#xE7;ais' (length=13)

Answer 3

首先，當我最近遇到這個問題時，我通過確保我的代碼文件、數據庫連接和數據庫表都是 UTF-8 來解決它然后，簡單地回顯文本即可。 如果您必須轉義 DB 的輸出，請使用htmlspecialchars()而不是htmlentities()這樣 UTF-8 符號就不會被嘗試轉義。

想記錄一個替代解決方案，因為它為我解決了類似的問題。 我正在使用 PHP 的utf8_encode()來轉義“特殊”字符。

我想將它們轉換為 HTML 實體以進行顯示，我編寫了這段代碼是因為我想盡可能避免 iconv 或此類函數，因為並非所有環境都必須具有它們（如果不是這樣，請糾正我！）

$foo = 'This is my test string \u03b50';
echo unicode2html($foo);

function unicode2html($string) {
    return preg_replace('/\\\\u([0-9a-z]{4})/', '&#x$1;', $string);
}

希望這可以幫助有需要的人:-)

Answer 4

請參閱如何從 PHP 中的 unicode 代碼點獲取字符？ 對於一些允許您執行以下操作的代碼：

示例使用：

echo "Get string from numeric DEC value\n";
var_dump(mb_chr(50319, 'UCS-4BE'));
var_dump(mb_chr(271));

echo "\nGet string from numeric HEX value\n";
var_dump(mb_chr(0xC48F, 'UCS-4BE'));
var_dump(mb_chr(0x010F));

echo "\nGet numeric value of character as DEC string\n";
var_dump(mb_ord('ď', 'UCS-4BE'));
var_dump(mb_ord('ď'));

echo "\nGet numeric value of character as HEX string\n";
var_dump(dechex(mb_ord('ď', 'UCS-4BE')));
var_dump(dechex(mb_ord('ď')));

echo "\nEncode / decode to DEC based HTML entities\n";
var_dump(mb_htmlentities('tchüß', false));
var_dump(mb_html_entity_decode('tch&#252;&#223;'));

echo "\nEncode / decode to HEX based HTML entities\n";
var_dump(mb_htmlentities('tchüß'));
var_dump(mb_html_entity_decode('tch&#xFC;&#xDF;'));

echo "\nUse JSON encoding / decoding\n";
var_dump(codepoint_encode("tchüß"));
var_dump(codepoint_decode('tch\u00fc\u00df'));

輸出：

Get string from numeric DEC value
string(4) "ď"
string(2) "ď"

Get string from numeric HEX value
string(4) "ď"
string(2) "ď"

Get numeric value of character as DEC int
int(50319)
int(271)

Get numeric value of character as HEX string
string(4) "c48f"
string(3) "10f"

Encode / decode to DEC based HTML entities
string(15) "tch&#252;&#223;"
string(7) "tchüß"

Encode / decode to HEX based HTML entities
string(15) "tch&#xFC;&#xDF;"
string(7) "tchüß"

Use JSON encoding / decoding
string(15) "tch\u00fc\u00df"
string(7) "tchüß"

Answer 5

您還可以使用mb_encode_numericentity這是由PHP 4.0.6+（支持鏈接到PHP文檔）。

function unicode2html($value) {
    return mb_encode_numericentity($value, [
    //  start codepoint
    //  |       end codepoint
    //  |       |       offset
    //  |       |       |       mask
        0x0000, 0x001F, 0x0000, 0xFFFF,
        0x0021, 0x002C, 0x0000, 0xFFFF,
        0x002E, 0x002F, 0x0000, 0xFFFF,
        0x003C, 0x003C, 0x0000, 0xFFFF,
        0x003E, 0x003E, 0x0000, 0xFFFF,
        0x0060, 0x0060, 0x0000, 0xFFFF,
        0x0080, 0xFFFF, 0x0000, 0xFFFF
    ], 'UTF-8', true);
}

通過這種方式，還可以指示哪些字符范圍要轉換為十六進制實體，哪些要保留為字符。

用法示例：

$input = array(
    '"Meno più, PIÙ o meno"',
    '\'ÀÌÙÒLÈ PERCHÉ perché è sempre così non si sà\'',
    '<script>alert("XSS");</script>',
    '"`'
);

$output = array();
foreach ($input as $str)
    $output[] = unicode2html($str)

結果：

$output = array(
    '&#x22;Meno pi&#xF9;&#x2C; PI&#xD9; o meno&#x22;',
    '&#x27;&#xC0;&#xCC;&#xD9;&#xD2;L&#xC8; PERCH&#xC9; perch&#xE9; &#xE8; sempre cos&#xEC; non si s&#xE0;&#x27;',
    '&#x3C;script&#x3E;alert&#x28;&#x22;XSS&#x22;&#x29;;&#x3C;&#x2F;script&#x3E;',
    '&#x22;&#x60;'
);

Answer 6

這是類似於@hakre（2012 年 11 月 8 日 0:35）的解決方案，但針對 html 實體名稱：

$output = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($match) {
    list($utf8) = $match;
    $char = htmlentities($utf8, ENT_HTML5 | ENT_IGNORE);
    if ($char[0]!=='&' || (strlen($char)<2)) {
        $binary = mb_convert_encoding($utf8, 'UTF-32BE', 'UTF-8');
        $char = vsprintf('&#x%X;', unpack('N', $binary));
    } // (else $char is "&entity;", which is better)
    return $char;
}, $input);

$input = "Ob\xC3\xB3z w\xC4\x99drowny Ko\xC5\x82a";
// => $output: "Ob&oacute;z w&eogon;drowny Ko&lstrok;a"
//while @hakre/@Baba both codes:
// => $output: "Ob&#xF3;z w&#x119;drowny Ko&#x142;a"

但總是遇到不正確的問題 UTF-8，即：

$input = "Ob\xC3\xB3z w\xC4\x99drowny Ko\xC5\x82a - ok\xB3adka";
// means "Ob&oacute;z w&eogon;drowny Ko&lstrok;a -  - ok&lstrok;adka" in html ("\xB3" is ISO-8859-2/windows-1250)

但在這里

// => $output: (empty)

還有@hakre代碼... :(

將 unicode 轉換為 html 實體十六進制

問題描述

6 個解決方案

解決方案1
11 2012-11-08 00:35:44

解決方案2
8 已采納 2012-11-08 00:15:58

解決方案3
4 2013-02-09 06:30:07

解決方案4
0 2014-07-15 17:08:27

示例使用：

輸出：

解決方案5
0 2021-07-21 15:01:11

解決方案6
0 2023-01-11 05:36:57

將 unicode 轉換為 html 實體十六進制

問題描述

6 個解決方案

解決方案1 11 2012-11-08 00:35:44

解決方案2 8 已采納 2012-11-08 00:15:58

解決方案3 4 2013-02-09 06:30:07

解決方案4 0 2014-07-15 17:08:27

示例使用：

輸出：

解決方案5 0 2021-07-21 15:01:11

解決方案6 0 2023-01-11 05:36:57

解決方案1
11 2012-11-08 00:35:44

解決方案2
8 已采納 2012-11-08 00:15:58

解決方案3
4 2013-02-09 06:30:07

解決方案4
0 2014-07-15 17:08:27

解決方案5
0 2021-07-21 15:01:11

解決方案6
0 2023-01-11 05:36:57