简体   繁体   English

特殊 ä ö 字符会破坏 UTF-8 编码

[英]Special ä ö characters break UTF-8 encoding

A user on my site inputted special characters into a text field: ä ö我网站上的用户在文本字段中输入了特殊字符:ä ö

These apparently are not the same ä ö characters I can input from my keyboard because when I paste them into Programmer's Notepad, they split into two: a¨ o¨这些显然不是我可以从键盘输入的 ä ö 字符,因为当我将它们粘贴到程序员的记事本中时,它们分成两部分:a¨ o¨

On my site's server side I have a PHP script that identifies illegal special characters in user input and highligts them in an html error message with preg_replace .在我网站的服务器端,我有一个 PHP 脚本,它可以识别用户输入中的非法特殊字符,并在带有preg_replace的 html 错误消息中突出显示它们。

The character splitting happens there too so I get a normal letter a and o with a weird lone xCC character that breaks the UTF-8 string encoding and json_encode function fails as a result.字符拆分也在那里发生,所以我得到一个普通的字母 a 和 o,带有一个奇怪的单独 xCC 字符,它破坏了 UTF-8 字符串编码,结果json_encode函数失败。

What would be the best way to handle these characters?处理这些角色的最佳方法是什么? Should I try to replace the special ä ö chars and replace them with the regular ones or can I somehow catch the broken UTF-8 chars and remove or replace them?我应该尝试替换特殊的 ä ö 字符并用常规字符替换它们,还是可以以某种方式捕获损坏的 UTF-8 字符并删除或替换它们?

It's not that these characters have broken the encoding, it's just that Unicode is really complicated .不是这些字符破坏了编码,只是Unicode真的很复杂

Commonly used accented letters have their own code points in the Unicode standard, in this case:常用的重音字母在 Unicode 标准中有自己的代码点,在这种情况下:

  • U+00E4 "LATIN SMALL LETTER A WITH DIAERESIS" U+00E4“带分音符的拉丁文小写字母 A”
  • U+00F6 "LATIN SMALL LETTER O WITH DIAERESIS" U+00F6“带分音符的拉丁文小写字母 O”

However, to avoid encoding every possibility, particularly when multiple diacritics (accents) need to be placed on the same letter, Unicode includes "combining diacritics", such as:但是,为了避免对所有可能性进行编码,特别是当需要在同一个字母上放置多个变音符号(重音符号)时,Unicode 包括“组合变音符号”,例如:

  • U+0308 "COMBINING DIAERESIS" U+0308 "组合分色"

When placed after the code point for a normal letter, these code points add a diacritic to it when displaying.当放置在普通字母的代码点之后时,这些代码点在显示时会为其添加变音符号

As you've seen, this means there's two different ways to represent the same letter.正如您所见,这意味着有两种不同的方式来表示同一个字母。 To help with this, Unicode includes "normalization forms" defined in an annex to the Unicode standard :为了帮助解决这个问题,Unicode 包括在 Unicode 标准的附件中定义的“规范化形式”:

  • Normalization Form D (NFD): Canonical Decomposition规范化形式 D (NFD):规范分解
  • Normalization Form C (NFC): Canonical Decomposition, followed by Canonical Composition规范化形式 C (NFC):规范分解,然后是规范组合
  • Normalization Form KD (NFKD): Compatibility Decomposition归一化形式 KD (NFKD):兼容性分解
  • Normalization Form KC (NFKC): Compatibility Decomposition, followed by Canonical Composition归一化形式 KC (NFKC):兼容性分解,然后是规范组合

Ignoring the "Compatibility" forms for now, we have two options:暂时忽略“兼容性”表单,我们有两个选择:

  • Decomposition, which uses combining diacritics as often as possible分解,尽可能多地使用组合变音符号
  • Composition, which uses specific code points as often as possible组合,尽可能多地使用特定的代码点

So one possibility is to convert your input into NFC, which in PHP can be achieved with the Normalizer class in the intl extension .因此,一种可能性是将您的输入转换为 NFC,这在 PHP 中可以通过intl扩展中的Normalizer来实现。

However, not all combinations can be normalised to a form with no separate diacritics , so this doesn't solve all your problems.但是,并非所有组合都可以标准化为没有单独变音符号的形式,因此这并不能解决您的所有问题。 You'll also need to look at what characters exactly you want to allow, probably by matching Unicode character properties .您还需要查看您想要允许的确切字符,可能是通过匹配 Unicode 字符属性

You might also want to learn about "grapheme clusters" and use the relevant PHP functions .您可能还想了解“字素簇”并使用相关的 PHP 函数 A "grapheme cluster", or just "grapheme", is what most readers will think of as "a character" - eg a letter with all its diacritics, or a full ideogram. “字素簇”,或只是“字素”,是大多数读者会认为的“一个字符”——例如一个带有所有变音符号的字母,或一个完整的表意文字。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM