简体   繁体   English

难道PHP中的字符串文字只能以兼容ASCII的超集的编码(例如UTF-8或ISO-8859-1)进行编码吗?

[英]Is it true that string literals in PHP can only be encoded in an encoding which is a compatible superset of ASCII, such as UTF-8 or ISO-8859-1?

I come across following text from the Details of the String Type page from PHP Manual : 我从PHP手册的“字符串类型”页面的“ 详细信息”中看到以下文本:

Given that PHP does not dictate a specific encoding for strings, one might wonder how string literals are encoded. 鉴于PHP并未规定字符串的特定编码,因此人们可能会想知道字符串文字是如何编码的。 String will be encoded in whatever fashion it is encoded in the script file. 字符串将以脚本文件中编码的任何方式进行编码。 Thus, if the script is written in ISO-8859-1, the string will be encoded in ISO-8859-1 and so on. 因此,如果脚本以ISO-8859-1编写,则字符串将以ISO-8859-1编码,依此类推。 However, this does not apply if Zend Multibyte is enabled; 但是,如果启用了Zend Multibyte,则不适用。 in that case, the script may be written in an arbitrary encoding (which is explicity declared or is detected) and then converted to a certain internal encoding, which is then the encoding that will be used for the string literals. 在那种情况下,脚本可以用任意编码(显式声明或检测到)编写,然后转换为特定的内部编码,然后该内部编码将用于字符串文字。 Note that there are some constraints on the encoding of the script (or on the internal encoding, should Zend Multibyte be enabled) – this almost always means that this encoding should be a compatible superset of ASCII, such as UTF-8 or ISO-8859-1. 请注意,脚本的编码(或内部编码,应启用Zend Multibyte)受到一些限制-这几乎总是意味着该编码应为ASCII的兼容超集,例如UTF-8或ISO-8859 -1。

So my doubt is, is it true that string literals in PHP can only be encoded in an encoding which is a compatible superset of ASCII , such as UTF-8 or ISO-8859-1 and not in an encoding which is not a compatible superset of ASCII ? 因此,我的疑问是,PHP中的字符串文字是否只能以与ASCII兼容的超集(如UTF-8ISO-8859-1)进行编码,而不能与非兼容超集的编码进行编码,这是真的吗? ASCII码

Is it possible to encode string literals in PHP in some non-ASCII compatible encoding like UTF-16 , UTF-32 or some other such non-ASCII compatible encoding? 是否有可能在PHP编码字符串文字在像UTF-16,UTF-32或一些其它这样的非ASCII兼容编码一些非ASCII兼容编码? If yes then will the strings literals encoded in such one of the non-ASCII compatible encoding work with mb_string_* functions ? 如果是,那么以这种非ASCII兼容编码之一编码的字符串文字是否可以与mb_string_ *函数一起使用 If no, then what's the reason? 如果没有,那是什么原因?

Suppose, Zend Multibyte is enabled and I've set the internal encoding to a compatible superset of ASCII , such as UTF-8 or ISO-8859-1 or some other non-ASCII compatible encoding. 假设启用了Zend Multibyte ,并且我将内部编码设置为ASCII兼容超集,例如UTF-8ISO-8859-1或其他一些非ASCII兼容的编码。 Now, can I declare the encoding which is not a compatible superset of ASCII , such as UTF-16 or UTF-32 in the script file? 现在,我可以在脚本文件中声明不是ASCII 兼容超集的编码 ,例如UTF-16UTF-32吗?

If yes, then in this case what encoding the string literals would get encoded in? 如果是,那么在这种情况下将以哪种编码方式来编码字符串文字? If no, then what's the reason? 如果没有,那是什么原因?

Also, explain me how does this encoding thing work for string literals if Zend Multibyte is enabled? 另外,请解释一下如果启用了Zend Multibyte的话,这种编码方式对字符串文字如何起作用?

How to enable the Zend Multibyte ? 如何启用Zend Multibyte What's the main intention behind turning it On ? 背后有什么打开的主要用意何在? When it is required to turn it On ? 当需要打开它

It would be better if you could clear my doubts accompanied by suitable examples. 如果您能通过适当的例子来消除我的怀疑,那将更好。

Thank You. 谢谢。

String literals in PHP source code files are taken literally as the raw bytes which are present in the source code file. PHP源代码文件中的字符串文字按字面意义视为源代码文件中存在的原始字节。 If you have bytes in your source code which represent a UTF-16 string or anything else really, then you can use them directly: 如果您的源代码中有表示UTF-16字符串的字节或其他真正的字节,则可以直接使用它们:

$ echo -n '<?php echo "' > test.php
$ echo -n 日本語 | iconv -t UTF-16 >> test.php 
$ echo '";' >> test.php 
$ cat test.php 
<?php echo "??e?g,??";
$ cat test.php | xxd
00000000: 3c3f 7068 7020 6563 686f 2022 feff 65e5  <?php echo "..e.
00000010: 672c 8a9e 223b 0a                        g,..";.
$ php test.php 
??e?g,??$ 
$ php test.php | iconv -f UTF-16
日本語

This demonstrates a source code file ostensibly written in ASCII, but containing a UTF-16 string literal in the middle, which is output as is. 这演示了一个表面上用ASCII编写的源代码文件,但中间包含一个UTF-16字符串文字,该文件原样输出。

The bigger problem with this kind of source code is that it's difficult to work with. 这种源代码的最大问题是很难使用。 It's somewhere between a pain in the neck and impossible to get a text editor to treat the PHP code in one encoding and string literals in another. 介于痛苦之间,无法让文本编辑器以一种编码方式处理PHP代码,以另一种编码方式处理字符串文本。 So typically, you want to keep the entire source code, including string literals, in one and the same encoding throughout. 因此,通常,您希望将整个源代码(包括字符串文字)保持为一种且始终采用相同的编码。

You can also easily get into trouble: 您也很容易遇到麻烦:

$ echo -n '<?php echo "' > test.php
$ echo -n 漢字 | iconv -t UTF-16 >> test.php 
$ echo '";' >> test.php 
$ cat test.php | xxd
00000000: 3c3f 7068 7020 6563 686f 2022 feff 6f22  <?php echo "..o"
00000010: 5b57 223b 0a                             [W";.

"漢字" here is encoded to feff 6f22 5b57 , which contains 22 or " , a string literal terminator, which means you have a syntax error now. 此处的“汉字”被编码为feff 6f22 5b57 ,其中包含22或字符串字符串文字终止符" ,这意味着您现在遇到语法错误。

By default the PHP interpreter expects the PHP code to be ASCII compatible, so if you want to keep your string literals and the rest of the source code in the same encoding, you're pretty much limited to ASCII compatible encodings. 默认情况下,PHP解释器期望PHP代码与ASCII兼容,因此,如果您希望将字符串文字和其余的源代码保持在相同的编码中,则几乎仅限于与ASCII兼容的编码。 However, the Zend Multibyte extension allows you to use other encodings if you declare the used encoding accordingly (in php.ini if it's not ASCII compatible). 但是, 如果相应地声明使用的编码 ,则Zend Multibyte扩展名允许您使用其他编码(如果不兼容ASCII,则在php.ini中)。 So you could write your source code in, say, Shift-JIS throughout; 因此,您可以始终使用Shift-JIS编写源代码。 probably even with string literals in some other encoding*. 甚至可能还有其他一些编码形式的字符串文字*。

* (At which point I'll quit going into details because what is wrong with you ?!) *(这时我将不再赘述,因为怎么了?!)

Summary: 摘要:

  • PHP must understand all the PHP code; PHP必须了解所有PHP代码; by default it understands ASCII, with Zend Multibyte it can understand other encodings as well. 默认情况下,它可以理解ASCII,通过Zend Multibyte,它也可以理解其他编码。
  • The string literals in your source code can contain any bytes you want, as long as PHP doesn't interpret them as special characters in the string literal (eg the 22 example above), in which case you need to escape them (with a backslash in the encoding of the general source code). 源代码中的字符串文字可以包含所需的任何字节,只要PHP不会将它们解释为字符串文字中的特殊字符(例如,上面的22示例),在这种情况下,您需要对它们进行转义(使用反斜杠)在通用源代码的编码中)。
  • The string value at runtime will be the raw byte sequence PHP read from the string literal. 运行时的字符串值将是从字符串文字中读取的原始字节序列PHP。

Having said all this, it is typically a pain in the neck to diverge from ASCII compatible encodings. 综上所述,与ASCII兼容编码相背离通常是一个痛苦的过程。 It's a pain in text editors and easily leads to mojibake if some tool in your workflow is treating the file incorrectly. 如果您的工作流程中的某些工具对文件的处理不正确,这在文本编辑器中是很痛苦的,并且很容易导致mojibake。 At most I'd advice to use ASCII-compatible encodings, eg: 我最多建议使用兼容ASCII的编码,例如:

echo "日本語";  // UTF-8 encoded (let's hope)

If you must have a non-ASCII-compatible string literal, you should use byte notation: 如果必须具有不兼容ASCII的字符串文字,则应使用字节符号:

echo "\xfe\xff\x65\xe5\x67\x2c\x8a\x9e";

Or conversion: 或转换:

echo iconv('UTF-8', 'UTF-16', '日本語');

[..] will the strings literals encoded in such one of the non-ASCII compatible encoding work with mb_string_* functions? [..]以这种非ASCII兼容编码之一编码的字符串文字是否可以与mb_string_*函数一起使用?

Sure, strings in PHP are raw byte arrays for all intents and purposes. 当然,出于所有目的和目的,PHP中的字符串都是原始字节数组。 It doesn't matter how you obtained that string. 无论如何获取该字符串都无关紧要。 If you have a UTF-16 string obtained with any of the methods demonstrated above, including by hardcoding it in UTF-16 into the source code, you have a UTF-16 encoded string and you can put that through any and all string functions that know how to deal with it. 如果您具有通过上述任何一种方法获得的UTF-16字符串(包括通过将其以UTF-16形式硬编码为源代码),则您具有UTF-16编码的字符串,可以将其放入任何和所有的字符串函数中,知道如何处理。

So my doubt is, is it true that string literals in PHP can only be encoded in an encoding which is a compatible superset of ASCII, such as UTF-8 or ISO-8859-1 and not in an encoding which is not a compatible superset of ASCII? 因此,我的疑问是,PHP中的字符串文字是否只能以与ASCII兼容的超集(例如UTF-8或ISO-8859-1)进行编码,而不能以与兼容超集不兼容的编码进行编码?的ASCII?

It's not true. 这不是真的。

Is it possible to encode string literals in PHP in some non-ASCII compatible encoding like UTF-16, UTF-32 or some other such non-ASCII compatible encoding? 是否可以使用某些非ASCII兼容的编码(例如UTF-16,UTF-32或其他一些非ASCII兼容的编码)在PHP中编码字符串文字? If yes then will the strings literals encoded in such one of the non-ASCII compatible encoding work with mb_string_* functions? 如果是,那么以这种非ASCII兼容编码之一编码的字符串文字是否可以与mb_string_ *函数一起使用? If no, then what's the reason? 如果没有,那是什么原因?

As @deceze says, You can easily convert the string to encoding you want via mb_convert_encoding or iconv . 如@deceze所说,您可以通过mb_convert_encodingiconv轻松地将字符串转换为所需的编码。

From the Details of string type in PHP Manual, String will be encoded in whatever fashion it is encoded in the script file. 从PHP手册中的字符串类型详细信息中 ,将以脚本文件中编码的任何方式对字符串进行编码。 PHP built with Zend Multibyte support and mbstring extension can parse and run PHP files that have encoded in non-ASCII compatible encoding like UTF-16, See tests in Zend/multibyte . 借助Zend Multibyte支持和mbstring扩展构建的PHP可以解析和运行以非ASCII兼容编码(如UTF-16)编码的PHP文件,请参见Zend / multibyte中的测试。

Zend/tests/multibyte/multibyte_encoding_003.phpt is demonstrated for running sources with UTF-16 LE encoding that output Hello World correctly. 演示了Zend/tests/multibyte/multibyte_encoding_003.phpt可用于运行带有UTF-16 LE编码的源,这些源可正确输出Hello World。

Zend/tests/multibyte/multibyte_encoding_003.phpt Zend / tests / multibyte / multibyte_encoding_003.phpt

--TEST--
Zend Multibyte and UTF-16 BOM
--SKIPIF--
<?php
if (!in_array("zend.detect_unicode", array_keys(ini_get_all()))) {
  die("skip Requires configure --enable-zend-multibyte option");
}
if (!extension_loaded("mbstring")) {
  die("skip Requires mbstring extension");
}
?>
--INI--
zend.multibyte=1
mbstring.internal_encoding=iso-8859-1
--FILE--
<?php
print "Hello World\n";
?>
===DONE===

--EXPECT--
Hello World
===DONE===

$ run-tests.php --keep-php --show-out --show-php Zend/tests/multibyte/multibyte_encoding_003.phpt $ run-tests.php --keep-php --show-out --show-php Zend / tests / multibyte / multibyte_encoding_003.phpt

 ... skip some trivial message ...
Running selected tests.
TEST 1/1 [multibyte_encoding_003.phpt]
========TEST========
<?php
print "Hello World\n";
?>
===DONE===
========DONE========

========OUT========
Hello World
===DONE===
========DONE========
PASS Zend Multibyte and UTF-16 BOM [multibyte_encoding_003.phpt]
=====================================================================
Number of tests :    1                 1
Tests skipped   :    0 (  0.0%) --------
Tests warned    :    0 (  0.0%) (  0.0%)
Tests failed    :    0 (  0.0%) (  0.0%)
Expected fail   :    0 (  0.0%) (  0.0%)
Tests passed    :    1 (100.0%) (100.0%)
---------------------------------------------------------------------
Time taken      :    0 seconds
=====================================================================

$ file multibyte_encoding_003.php $文件multibyte_encoding_003.php

multibyte_encoding_003.php: PHP script text, Little-endian UTF-16 Unicode text

Another example is Zend/tests/multibyte/multibyte_encoding_004.phpt , It runs source which encoded with Shift JIS . 另一个示例是Zend/tests/multibyte/multibyte_encoding_004.phpt ,它运行使用Shift JIS编码的源。

Zend/tests/multibyte/multibyte_encoding_004.phpt (Note: Some Japanese characters are not display correctly because of mixing encoding in one file and LC_MESSAGE is set to UTF-8 ) Zend / tests / multibyte / multibyte_encoding_004.phpt (注意:由于一些文件中混合了编码,并且LC_MESSAGE设置为UTF-8因此某些日语字符无法正确显示)

--TEST--
test for mbstring script_encoding for flex unsafe encoding (Shift_JIS)
--SKIPIF--
<?php
if (!in_array("zend.detect_unicode", array_keys(ini_get_all()))) {
  die("skip Requires configure --enable-zend-multibyte option");
}
if (!extension_loaded("mbstring")) {
  die("skip Requires mbstring extension");
}
?>
--INI--
zend.multibyte=1
zend.script_encoding=Shift_JIS
mbstring.internal_encoding=Shift_JIS
--FILE--
<?php
        function \\\($)
        {
                echo $;
        }

        \\\("h~t@\");
?>
--EXPECT--
h~t@\

$ run-tests.php --keep-php --show-out --show-php $ run-tests.php --keep-php --show-out --show-php
./multibyte_encoding_004.phpt ./multibyte_encoding_004.phpt

 ... skip some trivial message ...
Running selected tests.
TEST 1/1 [multibyte_encoding_004.phpt]
========TEST========
<?php
        function \\\($)
        {
                echo $;
        }

        \\\("h~t@\");
?>
========DONE========

========OUT========
h~t@\
========DONE========
PASS test for mbstring script_encoding for flex unsafe encoding (Shift_JIS) [multibyte_encoding_004.phpt]
=====================================================================
Number of tests :    1                 1
Tests skipped   :    0 (  0.0%) --------
Tests warned    :    0 (  0.0%) (  0.0%)
Tests failed    :    0 (  0.0%) (  0.0%)
Expected fail   :    0 (  0.0%) (  0.0%)
Tests passed    :    1 (100.0%) (100.0%)
---------------------------------------------------------------------
Time taken      :    0 seconds
=====================================================================

$ file Zend/tests/multibyte/multibyte_encoding_004.php $文件Zend / tests / multibyte / multibyte_encoding_004.php

multibyte_encoding_004.php: PHP script text, Non-ISO extended-ASCII text

$ cat Zend/tests/multibyte/multibyte_encoding_004.php | $ cat Zend / tests / multibyte / multibyte_encoding_004.php | iconv -f SJIS -t utf-8 iconv -f SJIS -t utf-8

<?php
        function 予蚕能($引数)
        {
                echo $引数;
        }

        予蚕能("ドレミファソ");
?>

Is it possible to encode string literals in PHP in some non-ASCII compatible encoding like UTF-16, UTF-32 or some other such non-ASCII compatible encoding? 是否可以使用某些非ASCII兼容的编码(例如UTF-16,UTF-32或其他一些非ASCII兼容的编码)在PHP中编码字符串文字? If yes then will the strings literals encoded in such one of the non-ASCII compatible encoding work with mb_string_* functions? 如果是,那么以这种非ASCII兼容编码之一编码的字符串文字是否可以与mb_string_ *函数一起使用? If no, then what's the reason? 如果没有,那是什么原因?

The answer to the first question is yes, The tests for Zend Multibyte is convincingly demonstrated. 第一个问题的答案是肯定的Zend Multibyte的测试令人信服地演示了。 The answer for the second question is also yes if given the correct encoding hints to mb_string_* . 如果给mb_string_*正确的编码提示,则第二个问题的答案也为是。

Suppose, Zend Multibyte is enabled and I've set the internal encoding to a compatible superset of ASCII, such as UTF-8 or ISO-8859-1 or some other non-ASCII compatible encoding. 假设启用了Zend Multibyte,并且我已将内部编码设置为ASCII的兼容超集,例如UTF-8或ISO-8859-1或某些其他非ASCII兼容编码。 Now, can I declare the encoding which is not a compatible superset of ASCII, such as UTF-16 or UTF-32 in the script file? 现在,我可以在脚本文件中声明不是兼容的ASCII超集的编码,例如UTF-16或UTF-32吗?

If yes, then in this case what encoding the string literals would get encoded in? 如果是,那么在这种情况下将以哪种编码方式来编码字符串文字? If no, then what's the reason? 如果没有,那是什么原因?

Yes, The output generated by second command is UTF-32 encoding (Represents single character as 4 bytes) 是的,第二个命令生成的输出是UTF-32编码(将单个字符表示为4个字节)

$ echo -e '<?php\necho "Hello 中文";' | php  | hexdump -C
00000000  48 65 6c 6c 6f 20 e4 b8  ad e6 96 87              |Hello ......|
0000000c

$ echo '<?php\\necho "Hello 中文";' | iconv -t utf-16 | php -d zend.multibyte=1 -d zend.script_encoding=UTF-16 -d mbstring.internal_encoding=UTF-32 | hexdump -C
00000000  00 00 00 48 00 00 00 65  00 00 00 6c 00 00 00 6c  |...H...e...l...l|
00000010  00 00 00 6f 00 00 00 20  00 00 4e 2d 00 00 65 87  |...o... ..N-..e.|
00000020

Also, explain me how does this encoding thing work for string literals if Zend Multibyte is enabled? 另外,请解释一下如果启用了Zend Multibyte的话,这种编码方式对字符串文字如何起作用?

Zend Multibyte feature is implemented on Zend/zend_multibyte.c , Let Zend engine knows more encoding other than Ascii and UTF-8, It is only the interface for encoding stuff, because the default implementation is dummy function , The real implementation is the mbstring extension, Therefore, mbstring is mandatory extension to get multibyte support when loaded . Zend Multibyte功能是在Zend / zend_multibyte.c上实现的,让Zend引擎知道除Ascii和UTF-8之外的更多编码,这只是用于编码的接口,因为默认实现是伪函数 ,真正的实现是mbstring扩展,因此, mbstring是强制性扩展,以在加载时获得多字节支持。

$ php -m | grep mbstring
mbstring
$ php -n -m | grep mbstring # -n disable mbstring, No configuration (ini) files will be used.
$ echo -e '<?php\n echo "Hello 中文\n"; ' | iconv -t utf-16 | php -n -d zend.multibyte=1

Fatal error: Could not convert the script from the detected encoding "UTF-32LE" to a compatible encoding in Unknown on line 0

How to enable the Zend Multibyte? 如何启用Zend Multibyte? What's the main intention behind turning it On? 开启它的主要目的是什么? When it is required to turn it On? 什么时候需要打开它?

Declare zend.multibyte=1 in php.ini will enable parsing of source files in multibyte encodings, Also you can pass -d zend.multibyte=1 to PHP cli executable as above example to enable multibyte support in PHP Zend engine. 在php.ini中声明zend.multibyte = 1将启用对多字节编码的源文件的解析。您也可以将-d zend.multibyte=1传递给PHP cli可执行文件,如上例所示,以在PHP Zend引擎中启用多字节支持。

How to enable the Zend Multibyte? 如何启用Zend Multibyte?

Compile PHP using the --enable-zend-multibyte flag (before PHP 5.4) and activate the zend.multibyte setting in the php.ini . 使用--enable-zend-multibyte标志(在PHP 5.4之前)编译PHP并激活php.inizend.multibyte设置。

Cf. cf. https://secure.php.net/manual/en/ini.core.php#ini.zend.multibyte and https://secure.php.net/manual/en/configure.about.php#configure.options.php https://secure.php.net/manual/en/ini.core.php#ini.zend.multibytehttps://secure.php.net/manual/en/configure.about.php#configure.options。的PHP

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM