简体   繁体   English

将所有符号转换为html实体

[英]Convert ALL symbols to html entities

In PHP using the built-in functions don't seem to include special and new symbols. 在PHP中使用内置函数似乎不包括特殊符号和新符号。 ALL including the ones released 3 months ago. 全部包括3个月前发布的。 Looking to turn a string with mixed symbols such as: 想要使用混合符号来转换字符串,例如:

𝕃𝕆𝕃 𝔯𝔬𝔠𝔰 𝓂𝓎 δϱж ☎

into

𝕃𝕆𝕃 𝔯𝔬𝔠𝔰 𝓂𝓎 δϱж ☎

(which the browser will render the same) (浏览器将呈现相同的)

I see this being done on the fly. 我看到这是在飞行中完成的。 We're talking countless symbols here. 我们在这里谈论无数的符号。 And who knows how many more in the future. 谁知道未来会有多少。

How are they achieving this? 他们如何实现这一目标? No way they really have a 1000+ key array of every single symbol and its entity? 他们真的不具备每个符号及其实体的1000多个关键数组吗?

I've hit all the related questions, no luck so far. 我已经遇到了所有相关的问题,到目前为止没有运气。

This function will convert every character (current and future) excluding [0-9A-Za-z ] to a numeric entity. 此函数将除[0-9A-Za-z ]之外的每个字符(当前和未来)转换为数字实体。 The UTF-8 character encoding is assumed: 假设UTF-8字符编码:

function html_entity_encode_all($s) {
    $out = '';
    for ($i = 0; isset($s[$i]); $i++) {
        // read UTF-8 bytes and decode to a Unicode codepoint value:
        $x = ord($s[$i]);
        if ($x < 0x80) {
            // single byte codepoints
            $codepoint = $x;
        } else {
            // multibyte codepoints
            if ($x >= 0xC2 && $x <= 0xDF) {
                $codepoint = $x & 0x1F;
                $length = 2;
            } else if ($x >= 0xE0 && $x <= 0xEF) {
                $codepoint = $x & 0x0F;
                $length = 3;
            } else if ($x >= 0xF0 && $x <= 0xF4) {
                $codepoint = $x & 0x07;
                $length = 4;
            } else {
                // invalid byte
                $codepoint = 0xFFFD;
                $length = 1;
            }
            // read continuation bytes of multibyte sequences:
            for ($j = 1; $j < $length; $j++, $i++) {
                if (!isset($s[$i + 1])) {
                    // invalid: string truncated in middle of multibyte sequence
                    $codepoint = 0xFFFD;
                    break;
                }
                $x = ord($s[$i + 1]);
                if (($x & 0xC0) != 0x80) {
                    // invalid: not a continuation byte
                    $codepoint = 0xFFFD;
                    break;
                }
                $codepoint = ($codepoint << 6) | ($x & 0x3F);
            }
            if (($codepoint > 0x10FFFF) ||
                ($length == 2 && $codepoint < 0x80) ||
                ($length == 3 && $codepoint < 0x800) ||
                ($length == 4 && $codepoint < 0x10000)) {
                // invalid: overlong encoding or out of range
                $codepoint = 0xFFFD;
            }
        }

        // have codepoint, now output:
        if (($codepoint >= 48 && $codepoint <= 57) ||
            ($codepoint >= 65 && $codepoint <= 90) ||
            ($codepoint >= 97 && $codepoint <= 122) ||
            ($codepoint == 32)) {
            // leave plain 0-9, A-Z, a-z, and space unencoded
            $out .= $s[$i];
        } else {
            // all others as numeric entities
            $out .= '&#' . $codepoint . ';';
        }
    }
    return $out;
}

For decoding, the standard function html_entity_decode can be used. 对于解码,可以使用标准函数html_entity_decode

How are they achieving this? 他们如何实现这一目标? No way they really have a 1000+ key array of every single symbol and its entity? 他们真的不具备每个符号及其实体的1000多个关键数组吗?

They do in fact have a translation table and it does contain all the symbols you have in your question (and the table has more than 1500 entries :) ). 事实上它们确实有一个转换表 ,它确实包含你问题中的所有符号(并且表中有超过1500个条目:))。

Fiddle 小提琴

Simple: the encoding doesn't use any special knowledge. 简单:编码不使用任何特殊知识。 The input is a numerical character value, the output is &#<decimal-value>; 输入是数字字符值,输出为&#<decimal-value>; .

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM