简体   繁体   English

替换Unicode格式的特殊字符

[英]Replace unicode formatted special characters

I need to replace special characters inside a string with other characters. 我需要用其他字符替换字符串中的特殊字符。 For example a "ä" can be replaced by either "a" or "ae" and a "à" with "a" as well. 例如,“ä”可以替换为“ a”或“ ae”,而“à”也可以替换为“ a”。 Normally this is pretty easy to do with PHP and there are lots of functions on stackoverflow, which already do excactly that. 通常,使用PHP相当容易,并且stackoverflow上有很多函数,这些函数已经做到了。

Unfortunately my string looks like this: "u\̈ a\̂ a\̈ o\̀.zip" (ü â ä ò.zip). 不幸的是,我的字符串看起来像这样:“ u \\ u0308 a \\ u0302 a \\ u0308 o \\ u0300.zip”(üâäò.zip)。 As you might see my strings are file names and OSX seems to convert the characters to unicode (at least that is what i think). 如您所见,我的字符串是文件名,OSX似乎将字符转换为unicode(至少我是这样认为的)。

I know that i could use a very long array with all special characters to replace them in PHP: 我知道我可以使用带有所有特殊字符的非常长的数组来替换PHP中的它们:

$str = "u\u0308 a\u0302 a\u0308 o\u0300.zip";

$ch = array("u\u0308", "a\u0302", "a\u0308", "o\u0300");
$chReplace = = array("u", "a", "a", "o");

str_replace($ch, $chReplace, $str);

But I'm wondering if there is an easier way, so I don't have to do this manually for every character? 但是我想知道是否有更简单的方法,因此我不必为每个角色手动执行此操作吗?

You can solve this problem by dividing it into multiple steps: 您可以通过将其分为多个步骤来解决此问题:

  • Convert the Unicode code points to actual entities. 将Unicode代码点转换为实际实体。 This can be easily achieved using preg_replace() . 使用preg_replace()可以轻松实现。 For an explanation of how the regex works, see my answer here . 有关正则表达式工作原理的解释,请参见此处的答案

  • Now you will have a set of characters like ü 现在,您将拥有一组字符,例如ü . These are HTML entities. 这些是HTML实体。 To convert them into their corresponding character forms, use html_entity_decode() . 要将它们转换为相应的字符形式,请使用html_entity_decode()

  • You will now have a UTF-8 string. 您现在将拥有一个UTF-8字符串。 You need to convert it into ISO-8859-1 (Official ISO 8-bit Latin-1). 您需要将其转换为ISO-8859-1(正式的ISO 8位Latin-1)。 The //TRANSLIT part is to enable transileration. //TRANSLIT部分用于启用转译。 If this is enabled, when a character can't be represented in the target charset, it will try to approximate the result. 如果启用此功能,则当无法在目标字符集中表示字符时,它将尝试近似结果。

Code: 码:

// Set the locale to something that's UTF-8 capable
setlocale(LC_ALL, 'en_US.UTF-8');

$str = "u\u0308 a\u0302 a\u0308 o\u0300";

// Convert the codepoints to entities
$str = preg_replace("/\\\\u([0-9a-fA-F]{4})/", "&#x\\1;", $str);

// Convert the entities to a UTF-8 string
$str = html_entity_decode($str, ENT_QUOTES, 'UTF-8');

// Convert the UTF-8 string to an ISO-8859-1 string
echo iconv("UTF-8", "ISO-8859-1//TRANSLIT", $str);

Output: 输出:

u a a o

Demo 演示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM