简体   繁体   English

UTF-8一路打通

[英]UTF-8 all the way through

I'm setting up a new server and want to support UTF-8 fully in my web application.我正在设置一个新服务器并希望在我的 web 应用程序中完全支持 UTF-8。 I have tried this in the past on existing servers and always seem to end up having to fall back to ISO-8859-1.我过去曾在现有服务器上尝试过这种方法,但似乎总是最终不得不退回到 ISO-8859-1。

Where exactly do I need to set the encoding/charsets?我究竟需要在哪里设置编码/字符集? I'm aware that I need to configure Apache, MySQL, and PHP to do this — is there some standard checklist I can follow, or perhaps troubleshoot where the mismatches occur?我知道我需要配置 Apache、MySQL 和 PHP 来执行此操作 — 是否有一些我可以遵循的标准清单,或者可以解决出现不匹配的问题?

This is for a new Linux server, running MySQL 5, PHP, 5 and Apache 2.这是一个新的 Linux 服务器,运行 MySQL 5、PHP、5 和 Apache 2。

Data Storage :数据存储

  • Specify the utf8mb4 character set on all tables and text columns in your database.在数据库中的所有表和文本列上指定utf8mb4字符集。 This makes MySQL physically store and retrieve values encoded natively in UTF-8.这使得 MySQL 物理存储和检索在 UTF-8 中本地编码的值。 Note that MySQL will implicitly use utf8mb4 encoding if a utf8mb4_* collation is specified (without any explicit character set).请注意,如果指定了utf8mb4_*排序规则(没有任何显式字符集),MySQL 将隐式使用utf8mb4编码。

  • In older versions of MySQL (< 5.5.3), you'll unfortunately be forced to use simply utf8 , which only supports a subset of Unicode characters.在旧版本的 MySQL (< 5.5.3) 中,不幸的是,您将被迫简单地使用utf8 ,它仅支持 Unicode 字符的子集。 I wish I were kidding.我希望我在开玩笑。

Data Access :数据访问

  • In your application code (eg PHP), in whatever DB access method you use, you'll need to set the connection charset to utf8mb4 .在您的应用程序代码(例如 PHP)中,无论您使用何种 DB 访问方法,您都需要将连接字符集设置为utf8mb4 This way, MySQL does no conversion from its native UTF-8 when it hands data off to your application and vice versa.这样,当 MySQL 将数据交给您的应用程序时,它不会从其本机 UTF-8 进行转换,反之亦然。

  • Some drivers provide their own mechanism for configuring the connection character set, which both updates its own internal state and informs MySQL of the encoding to be used on the connection—this is usually the preferred approach.一些驱动程序提供了自己的机制来配置连接字符集,既更新了自己的内部 state,又通知 MySQL 要在连接上使用的编码——这通常是首选方法。 In PHP:在 PHP 中:

    • If you're using the PDO abstraction layer with PHP ≥ 5.3.6, you can specify charset in the DSN :如果您使用 PHP ≥ 5.3.6 的PDO抽象层,则可以在DSN中指定charset

       $dbh = new PDO('mysql:charset=utf8mb4');
    • If you're using mysqli , you can call set_charset() :如果您使用的是mysqli ,则可以调用set_charset()

       $mysqli->set_charset('utf8mb4'); // object oriented style mysqli_set_charset($link, 'utf8mb4'); // procedural style
    • If you're stuck with plain mysql but happen to be running PHP ≥ 5.2.3, you can call mysql_set_charset .如果你坚持使用普通的 mysql但碰巧正在运行 PHP ≥ 5.2.3,你可以调用mysql_set_charset

  • If the driver does not provide its own mechanism for setting the connection character set, you may have to issue a query to tell MySQL how your application expects data on the connection to be encoded: SET NAMES 'utf8mb4' .如果驱动程序没有提供自己的设置连接字符集的机制,您可能必须发出查询来告诉 MySQL 您的应用程序希望如何对连接上的数据进行编码: SET NAMES 'utf8mb4'

  • The same consideration regarding utf8mb4 / utf8 applies as above.关于utf8mb4 / utf8的相同考虑适用于上述。

Output : Output

  • UTF-8 should be set in the HTTP header, such as Content-Type: text/html; charset=utf-8 UTF-8 应在 HTTP header 中设置,如Content-Type: text/html; charset=utf-8 Content-Type: text/html; charset=utf-8 . Content-Type: text/html; charset=utf-8 You can achieve that either by setting default_charset in php.ini (preferred), or manually using header() function.您可以通过在 php.ini(首选)中设置default_charset或手动使用header() function 来实现。
  • If your application transmits text to other systems, they will also need to be informed of the character encoding.如果您的应用程序将文本传输到其他系统,则还需要告知它们字符编码。 With web applications, the browser must be informed of the encoding in which data is sent (through HTTP response headers or HTML metadata ).对于 web 应用程序,必须通知浏览器发送数据的编码(通过 HTTP 响应标头或HTML 元数据)。
  • When encoding the output using json_encode() , add JSON_UNESCAPED_UNICODE as a second parameter.使用json_encode()对 output 进行编码时,添加JSON_UNESCAPED_UNICODE作为第二个参数。

Input :输入

  • Browsers will submit data in the character set specified for the document, hence nothing particular has to be done on the input.浏览器将以为文档指定的字符集提交数据,因此无需对输入进行任何特殊处理。
  • In case you have doubts about request encoding (in case it could be tampered with), you may verify every received string as being valid UTF-8 before you try to store it or use it anywhere.如果您对请求编码有疑问(以防它可能被篡改),您可以在尝试将其存储或在任何地方使用之前验证每个接收到的字符串是有效的 UTF-8。 PHP's mb_check_encoding() does the trick, but you have to use it religiously. PHP 的mb_check_encoding()可以解决问题,但您必须虔诚地使用它。 There's really no way around this, as malicious clients can submit data in whatever encoding they want, and I haven't found a trick to get PHP to do this for you reliably.确实没有办法解决这个问题,因为恶意客户端可以以他们想要的任何编码提交数据,而且我还没有找到让 PHP 可靠地为您执行此操作的技巧。

Other Code Considerations :其他代码注意事项

  • Obviously enough, all files you'll be serving (PHP, HTML, JavaScript, etc.) should be encoded in valid UTF-8.显然,您将提供的所有文件(PHP、HTML、JavaScript 等)都应编码为有效的 UTF-8。

  • You need to make sure that every time you process a UTF-8 string, you do so safely.您需要确保每次处理 UTF-8 字符串时都是安全的。 This is, unfortunately, the hard part.不幸的是,这是困难的部分。 You'll probably want to make extensive use of PHP's mbstring extension.您可能希望广泛使用 PHP 的mbstring扩展。

  • PHP's built-in string operations are not by default UTF-8 safe. PHP 的内置字符串操作默认不是UTF-8 安全的。 There are some things you can safely do with normal PHP string operations (like concatenation), but for most things you should use the equivalent mbstring function.使用普通的 PHP 字符串操作(如连接)可以安全地执行一些操作,但对于大多数操作,您应该使用等效的mbstring function。

  • To know what you're doing (read: not mess it up), you really need to know UTF-8 and how it works on the lowest possible level.要知道你在做什么(阅读:不要搞砸了),你真的需要知道 UTF-8 以及它是如何在尽可能低的级别上工作的。 Check out any of the links from utf8.com for some good resources to learn everything you need to know.查看来自utf8.com的任何链接,以获取一些好的资源来了解您需要了解的所有内容。

I'd like to add one thing to chazomaticus' excellent answer :我想为chazomaticus 的出色回答添加一件事:

Don't forget the META tag either (like this, or the HTML4 or XHTML version of it ):不要忘记 META 标签(像这样,或者它的 HTML4 或 XHTML 版本):

<meta charset="utf-8">

That seems trivial, but IE7 has given me problems with that before.这似乎微不足道,但 IE7 之前给我带来了问题。

I was doing everything right;我做的一切都是正确的; the database, database connection and Content-Type HTTP header were all set to UTF-8, and it worked fine in all other browsers, but Internet Explorer still insisted on using the "Western European" encoding.数据库、数据库连接和 Content-Type HTTP header 都设置为 UTF-8,在所有其他浏览器中运行良好,但 Internet Explorer 仍然坚持使用“西欧编码”。

It turned out the page was missing the META tag.结果发现该页面缺少 META 标记。 Adding that solved the problem.添加解决了这个问题。

Edit:编辑:

The W3C actually has a rather large section dedicated to I18N . W3C 实际上有一个相当大的部分专门用于 I18N They have a number of articles related to this issue – describing the HTTP, (X)HTML and CSS side of things:他们有许多与此问题相关的文章——描述 HTTP、(X)HTML 和 CSS 方面的内容:

They recommend using both the HTTP header and HTML meta tag (or XML declaration in case of XHTML served as XML).他们建议同时使用 HTTP header 和 HTML 元标记(或在 X3501BB093D363810B671059B99 的情况下作为 XML.FHTMLED 声明)

In addition to setting default_charset in php.ini, you can send the correct charset using header() from within your code, before any output:除了在 php.ini 中设置default_charset之外,您还可以在任何 output 之前使用代码中的header()发送正确的字符集:

header('Content-Type: text/html; charset=utf-8');

Working with Unicode in PHP is easy as long as you realize that most of the string functions don't work with Unicode, and some might mangle strings completely .在 PHP 中使用 Unicode 很容易,只要您意识到大多数字符串函数不适用于 Unicode 并且有些可能会完全破坏字符串 PHP considers "characters" to be 1 byte long. PHP 认为“字符”的长度为 1 个字节。 Sometimes this is okay (for example, explode() only looks for a byte sequence and uses it as a separator -- so it doesn't matter what actual characters you look for).有时这是可以的(例如, explode()只查找字节序列并将其用作分隔符——因此您查找的实际字符并不重要)。 But other times, when the function is actually designed to work on characters , PHP has no idea that your text has multi-byte characters that are found with Unicode.但其他时候,当 function 实际设计用于处理字符时,PHP 不知道您的文本包含使用 Unicode 找到的多字节字符。

A good library to check into is phputf8 .一个很好的库是phputf8 This rewrites all of the "bad" functions so you can safely work on UTF8 strings.这会重写所有“坏”函数,因此您可以安全地处理 UTF8 字符串。 There are extensions like the mb_string extension that try to do this for you, too, but I prefer using the library because it's more portable (but I write mass-market products, so that's important for me).也有像mb_string扩展这样的扩展尝试为您执行此操作,但我更喜欢使用该库,因为它更便携(但我编写大众市场产品,所以这对我很重要)。 But phputf8 can use mb_string behind the scenes, anyway, to increase performance.但无论如何,phputf8 可以在幕后使用 mb_string 来提高性能。

Warning: This answer applies to PHP 5.3.5 and lower.警告:此答案适用于 PHP 5.3.5 及更低版本。 Do not use it for PHP version 5.3.6 (released in March 2011) or later.请勿将其用于 PHP 版本 5.3.6(2011 年 3 月发布)或更高版本。

Compare with Palec's answer to PDO + MySQL and broken UTF-8 encoding .Palec 对PDO + MySQL 和损坏的 UTF-8 编码的回答进行比较。


I found an issue with someone using PDO and the answer was to use this for the PDO connection string:我发现有人使用PDO存在问题,答案是将其用于 PDO 连接字符串:

$pdo = new PDO(
    'mysql:host=mysql.example.com;dbname=example_db',
    "username",
    "password",
    array(PDO::MYSQL_ATTR_INIT_COMMAND => "SET NAMES utf8"));

In my case, I was using mb_split , which uses regular expressions.就我而言,我使用的是mb_split ,它使用正则表达式。 Therefore I also had to manually make sure the regular expression encoding was UTF-8 by doing mb_regex_encoding('UTF-8');因此,我还必须通过执行mb_regex_encoding('UTF-8');

As a side note, I also discovered by running mb_internal_encoding() that the internal encoding wasn't UTF-8, and I changed that by running mb_internal_encoding("UTF-8");作为旁注,我还通过运行mb_internal_encoding()发现内部编码不是 UTF-8,我通过运行mb_internal_encoding("UTF-8");了更改。 . .

First of all, if you are in PHP before 5.3 then no.首先,如果您在 5.3 之前在 PHP 中,那么没有。 You've got a ton of problems to tackle.你有很多问题要解决。

I am surprised that none has mentioned the intl library, the one that has good support for Unicode , graphemes , string operations , localisation and many more, see below.我很惊讶没有人提到intl库,它对 Unicode 、字字符串操作本地化等等有很好的支持,见下文。

I will quote some information about Unicode support in PHP by Elizabeth Smith's slides at PHPBenelux'14我将在 PHPBenelux'14 的Elizabeth Smith 的 幻灯片中引用PHP中有关 Unicode 支持的一些信息

INTL国际

Good:好的:

  • Wrapper around ICU library ICU 图书馆的包装
  • Standardised locales, set locale per script标准化语言环境,为每个脚本设置语言环境
  • Number formatting数字格式
  • Currency formatting货币格式
  • Message formatting (replaces gettext)消息格式(替换 gettext)
  • Calendars, dates, time zone and time日历、日期、时区和时间
  • Transliterator音译
  • Spoofchecker恶搞检查器
  • Resource bundles资源包
  • Convertors转换器
  • IDN support国际化域名支持
  • Graphemes字形
  • Collation整理
  • Iterators迭代器

Bad:坏的:

  • Does not support zend_multibyte不支持 zend_multibyte
  • Does not support HTTP input output conversion不支持HTTP输入output转换
  • Does not support function overloading不支持 function 重载

mb_string mb_string

  • Enables zend_multibyte support启用 zend_multibyte 支持
  • Supports transparent HTTP in/out encoding支持透明 HTTP 输入/输出编码
  • Provides some wrappers for functionality such as strtoupper为 strtoupper 等功能提供一些包装器

ICONV图标

  • Primary for charset conversion主要用于字符集转换
  • Output buffer handler Output 缓冲区处理程序
  • mime encoding functionality mime 编码功能
  • conversion转换
  • some string helpers (len, substr, strpos, strrpos)一些字符串助手(len、substr、strpos、strrpos)
  • Stream Filter stream_filter_append($fp, 'convert.iconv.ISO-2022-JP/EUC-JP') Stream 过滤器stream_filter_append($fp, 'convert.iconv.ISO-2022-JP/EUC-JP')

DATABASES数据库

  • MySQL: Charset and collation on tables and on the connection (not the collation). MySQL:表和连接上的字符集和排序规则(不是排序规则)。 Also, don't use mysql - mysqli or PDO另外,不要使用 mysql - mysqli 或 PDO
  • postgresql: pg_set_client_encoding postgresql:pg_set_client_encoding
  • sqlite(3): Make sure it was compiled with Unicode and intl support sqlite(3):确保它是用 Unicode 和国际支持编译的

Some other gotchas其他一些陷阱

  • You cannot use Unicode filenames with PHP and windows unless you use a 3rd part extension.除非您使用第三部分扩展名,否则您不能将 Unicode 文件名与 PHP 和 windows 一起使用。
  • Send everything in ASCII if you are using exec, proc_open and other command line calls如果您使用 exec、proc_open 和其他命令行调用,则以 ASCII 格式发送所有内容
  • Plain text is not plain text, files have encodings纯文本不是纯文本,文件有编码
  • You can convert files on the fly with the iconv filter您可以使用 iconv 过滤器即时转换文件

The only thing I would add to these amazing answers is to emphasize on saving your files in UTF-8 encoding, I have noticed that browsers accept this property over setting UTF-8 as your code encoding.我要添加到这些惊人的答案中的唯一一件事是强调以 UTF-8 编码保存文件,我注意到浏览器接受此属性超过将 UTF-8 设置为您的代码编码。 Any decent text editor will show you this.任何体面的文本编辑器都会向您展示这一点。 For example, Notepad++ has a menu option for file encoding, and it shows you the current encoding and enables you to change it.例如, Notepad++有一个用于文件编码的菜单选项,它会向您显示当前编码并允许您更改它。 For all my PHP files I use UTF-8 without a BOM .对于我所有的 PHP 文件,我使用没有BOM的 UTF-8 。

Sometime ago I had someone ask me to add UTF-8 support for a PHP and MySQL application designed by someone else.前段时间有人要求我添加 UTF-8 支持,以支持其他人设计的 PHP 和 MySQL 应用程序。 I noticed that all files were encoded in ANSI, so I had to use iconv to convert all files, change the database tables to use the UTF-8 character set and utf8_general_ci collate, add 'SET NAMES utf8' to the database abstraction layer after the connection (if using 5.3.6 or earlier. Otherwise, you have to use charset=utf8 in the connection string) and change string functions to use the PHP multibyte string functions equivalent.我注意到所有文件都是用 ANSI 编码的,所以我不得不使用iconv来转换所有文件,将数据库表更改为使用 UTF-8 字符集和utf8_general_ci整理,在连接后将“SET NAMES utf8”添加到数据库抽象层(如果使用 5.3.6 或更早版本。否则,您必须在连接字符串中使用 charset=utf8)并更改字符串函数以使用等效的 PHP 多字节字符串函数。

I recently discovered that using strtolower() can cause issues where the data is truncated after a special character.我最近发现使用strtolower()可能会导致数据在特殊字符后被截断的问题。

The solution was to use解决方案是使用

mb_strtolower($string, 'UTF-8');

mb_ uses MultiByte. mb_ 使用多字节。 It supports more characters but in general is a little slower.它支持更多字符,但通常速度较慢。

In PHP, you'll need to either use the multibyte functions , or turn on mbstring.func_overload .在 PHP 中,您需要使用多字节函数,或者打开mbstring.func_overload That way things like strlen will work if you have characters that take more than one byte.如果你的字符超过一个字节,那么像 strlen 这样的东西就会起作用。

You'll also need to identify the character set of your responses.您还需要确定响应的字符集。 You can either use AddDefaultCharset, as above, or write PHP code that returns the header.您可以如上所述使用 AddDefaultCharset,也可以编写返回 header 的 PHP 代码。 (Or you can add a META tag to your HTML documents.) (或者您可以在 HTML 文档中添加 META 标签。)

I have just gone through the same issue and found a good solution at PHP manuals.我刚刚经历了同样的问题,并在 PHP 手册中找到了一个很好的解决方案。

I changed all my files' encoding to UTF8 and then the default encoding on my connection.我将所有文件的编码更改为 UTF8,然后更改为连接上的默认编码。 This solved all the problems.这解决了所有问题。

if (!$mysqli->set_charset("utf8")) {
    printf("Error loading character set utf8: %s\n", $mysqli->error);
} else {
   printf("Current character set: %s\n", $mysqli->character_set_name());
}

View Source查看源代码

Unicode support in PHP is still a huge mess. PHP 中的 Unicode 支持仍然是一团糟。 While it's capable of converting an ISO 8859 string (which it uses internally) to UTF-8, it lacks the capability to work with Unicode strings natively, which means all the string processing functions will mangle and corrupt your strings.虽然它能够将ISO 8859字符串(它在内部使用)转换为 UTF-8,但它缺乏原生处理 Unicode 字符串的能力,这意味着所有字符串处理函数都会破坏和破坏您的字符串。

So you have to either use a separate library for proper UTF-8 support, or rewrite all the string handling functions yourself.因此,您必须使用单独的库来获得适当的 UTF-8 支持,或者自己重写所有字符串处理函数。

The easy part is just specifying the charset in HTTP headers and in the database and such, but none of that matters if your PHP code doesn't output valid UTF-8. The easy part is just specifying the charset in HTTP headers and in the database and such, but none of that matters if your PHP code doesn't output valid UTF-8. That's the hard part, and PHP gives you virtually no help there.这是最难的部分,PHP 几乎没有给你任何帮助。 (I think PHP 6 is supposed to fix the worst of this, but that's still a while away.) (我认为 PHP 6 应该可以解决最糟糕的问题,但这还需要一段时间。)

If you want a MySQL server to decide the character set, and not PHP as a client (old behaviour; preferred, in my opinion), try adding skip-character-set-client-handshake to your my.cnf , under [mysqld] , and restart mysql .如果您想要 MySQL 服务器来决定字符集,而不是 PHP 作为客户端(旧行为;在我看来是首选),请尝试在[mysqld]下的my.cnf添加skip-character-set-client-handshake ,然后重新启动mysql

This may cause trouble in case you're using anything other than UTF-8.如果您使用的不是 UTF-8,这可能会导致麻烦。

The top answer is excellent.最佳答案非常好。 Here is what I had to on a regular Debian , PHP, and MySQL setup:这是我在常规Debian 、 PHP 和MySQL设置中必须要做的:

// Storage
// Debian. Apparently already UTF-8

// Retrieval
// The MySQL database was stored in UTF-8,
// but apparently PHP was requesting ISO 8859-1. This worked:
// ***notice "utf8", without dash, this is a MySQL encoding***
mysql_set_charset('utf8');

// Delivery
// File *php.ini* did not have a default charset,
// (it was commented out, shared host) and
// no HTTP encoding was specified in the Apache headers.
// This made Apache send out a UTF-8 header
// (and perhaps made PHP actually send out UTF-8)
// ***notice "utf-8", with dash, this is a php encoding***
ini_set('default_charset','utf-8');

// Submission
// This worked in all major browsers once Apache
// was sending out the UTF-8 header. I didn’t add
// the accept-charset attribute.

// Processing
// Changed a few commands in PHP, like substr(),
// to mb_substr()

That was all就这些

if you want a mysql solution, I had similar issues with 2 of my projects, after a server migration.如果你想要一个 mysql 解决方案,在服务器迁移之后,我的 2 个项目遇到了类似的问题。 After searching and trying a lot of solutions i came across with this one \/nothing before this one worked):在搜索并尝试了很多解决方案之后,我遇到了这个\/在这个工作之前什么都没有):

mysqli_set_charset($con,"utf8");

Just a note:只是一个说明:

You are facing the problem of your non-latin characters is showing as ?????????<\/code>您面临的问题是您的非拉丁字符显示为?????????<\/code> , you asked a question, and it got closed with a reference to this canonical question, you tried everything and no matter what you do you still get ??????????<\/code> ,你问了一个问题,并参考了这个规范问题,你尝试了一切,无论你做什么,你仍然得到??????????<\/code> from MySQL<\/code> .来自MySQL<\/code> 。

That is mostly because you are testing on your old data<\/strong> which has been inserted to the database using the wrong charset and got converted and stored to actually the question mark characters ?<\/code>这主要是因为您正在测试已使用错误字符集插入数据库并被转换并存储为实际问号字符的旧数据<\/strong>?<\/code> . . Which means you lost your original text forever and no matter what you try you will get ???????<\/code>这意味着您永远丢失了原始文本,无论您尝试什么,您都会得到???????<\/code> . .

re applying what you have learned from the answers of this question on a fresh data could solve your problem.将您从该问题的答案中学到的知识重新应用于新数据可以解决您的问题。

"

in connection.php: mysqli_set_charset($con,“utf8”);在 connection.php: mysqli_set_charset($con,“utf8”); and in sql collation utf=8并在 sql 整理中 utf=8

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM