[英]How to replace/remove 4(+)-byte characters from a UTF-8 string in PHP?
It seems like MySQL does not support characters with more than 3 bytes in its default UTF-8 charset.似乎 MySQL 不支持其默认 UTF-8 字符集中超过 3 个字节的字符。
So, in PHP, how can I get rid of all 4(-and-more)-byte characters in a string and replace them with something like by some other character?那么,在 PHP 中,我怎样才能摆脱字符串中的所有 4(和更多)字节字符并将它们替换为其他字符?
NOTE: you should not just strip, but replace with replacement character U+FFFD to avoid unicode attacks, mostly XSS:注意:你不应该只是剥离,而是替换为替换字符 U+FFFD 以避免 unicode 攻击,主要是 XSS:
http://unicode.org/reports/tr36/#Deletion_of_Noncharacters http://unicode.org/reports/tr36/#Deletion_of_Noncharacters
preg_replace('/[\x{10000}-\x{10FFFF}]/u', "\xEF\xBF\xBD", $value);
Since 4-byte UTF-8 sequences always start with the bytes 0xF0-0xF7
, the following should work:由于 4 字节 UTF-8 序列始终以字节
0xF0-0xF7
开头,因此以下应该有效:
$str = preg_replace('/[\xF0-\xF7].../s', '', $str);
Alternatively, you could use preg_replace
in UTF-8 mode but this will probably be slower:或者,您可以在 UTF-8 模式下使用
preg_replace
,但这可能会更慢:
$str = preg_replace('/[\x{10000}-\x{10FFFF}]/u', '', $str);
This works because 4-byte UTF-8 sequences are used for code points in the supplementary Unicode planes starting from 0x10000
.这是有效的,因为 4 字节 UTF-8 序列用于补充 Unicode 平面中的代码点,从
0x10000
开始。
Here's an example:下面是一个例子:
<?php
mb_internal_encoding("UTF-8");
//utf8 string, 13 bytes, 9 utf8 chars, 7 ASCII, 1 in latin1, 1 outside the BMP
$str = "qué \xF0\x9D\x92\xB3 tal";
$array = mbStringToArray($str);
print "str: [$str] strlen:" . strlen($str) . " chars:" . count($array) . "\n";
$str1 = "";
foreach($array as $c) {
// print "$c : " . strlen($c) ."\n";
$str1 .= strlen($c)<=3? $c : '?';
}
print "[$str1]\n";
function mbStringToArray ($str) {
if (empty($str)) return false;
$len = mb_strlen($str);
$array = array();
for ($i = 0; $i < $len; $i++) {
$array[] = mb_substr($str, $i, 1);
}
return $array;
}
Or, a little more compact and efficient:或者,更紧凑和更高效:
<?php ///
mb_internal_encoding("UTF-8");
//utf8 string, 13 bytes, 9 utf8 chars, 7 ASCII, 1 in latin1, 1 outside the BMP
$str = "qué \xF0\x9D\x92\xB3 tal";
$str1 = trimOutsideBMP($str);
print "original: [$str]\n";
print "trimmed: [$str1]\n";
// Replaces non-BMP characters in the UTF-8 string by a '?' character
// Assumes UTF-8 default encoding ( if not sure, call first mb_internal_encoding("UTF-8"); )
function trimOutsideBMP($str) {
if (empty($str)) return $str;
$len = mb_strlen($str);
$str1 = '';
for ($i = 0; $i < $len; $i++) {
$c = mb_substr($str, $i, 1);
$str1 .= strlen($c) <= 3 ? $c : '?';
}
return $str1;
}
Came across this question when trying to solve my own issue (Facebook spits out certain emoticons as 4-byte characters, Amazon Mechanical Turk does not accept 4-byte characters).在尝试解决我自己的问题时遇到了这个问题(Facebook 将某些表情符号吐出为 4 字节字符,Amazon Mechanical Turk 不接受 4 字节字符)。
I ended up using this, doesn't require mbstring extension:我最终使用了这个,不需要 mbstring 扩展:
function remove_4_byte($string) {
$char_array = preg_split('/(?<!^)(?!$)/u', $string );
for($x=0;$x<sizeof($char_array);$x++) {
if(strlen($char_array[$x])>3) {
$char_array[$x] = "";
}
}
return implode($char_array, "");
}
Below function change 3 and 4 bytes characters from utf8 string to '#':下面的函数将 3 个和 4 个字节的字符从 utf8 字符串更改为“#”:
function remove3and4bytesCharFromUtf8Str($str) {
return preg_replace('/([\xF0-\xF7]...)|([\xE0-\xEF]..)/s', '#', $str);
}
Here is my implementation to filter out 4-byte chars这是我过滤掉 4 字节字符的实现
$string = preg_replace_callback(
'/./u',
function (array $match) {
return strlen($match[0]) >= 4 ? null : $match[0];
},
$string
);
you could tweak it and replace null
(which removes the char) with some substitute string.您可以调整它并用一些替代字符串替换
null
(删除字符)。 You can also replace >= 4
with some other byte-length check.您还可以用其他一些字节长度检查替换
>= 4
。
Another filter implementation, more complex.另一个过滤器实现,更复杂。
It try transliterate to ASCII characters, otherwise iserts unicode replacement character to avoid XSS, eg.: <a href='java\script:alert("XSS")'>
它尝试转写为 ASCII 字符,否则会使用 unicode 替换字符来避免 XSS,例如:
<a href='java\script:alert("XSS")'>
$tr = preg_replace_callback('/([\x{10000}-\x{10FFFF}])/u', function($m){
$c = iconv('ISO-8859-2', 'UTF-8',iconv('utf-8','ISO-8859-2//TRANSLIT//IGNORE', $m[1]));
if($c == '')
return '�';
return $c;
}, $s);
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.