简体   繁体   中英

Replacing low ASCII characters in UTF-16-encoded string using PHP's str_replace function

I have some PHP code that I use for text filtering. During filtering, some ASCII characters such as ampersand (&) and tilde (~) are temporarily converted to low ASCII characters (such as decimal code-points 4 and 5). Just before the final filtered output is generated, the conversion is reverted.

$temp = str_replace(array('&', '~'), array("\x04", "\x05"), $input);
... some filtering code to work with $temp ...
$out = str_replace(array("\x04", "\x05"), array('&', '~'), $temp);

This works well with input text of character encodings that use 8-bit code units such as UTF-8 and ISO 8859-1. But I am not sure about input encoded in larger code units, such as UTF-16 or UTF-32. Will the first conversion step mangle the well-formedness of the input text? Will there be some conflict during the reversion step because of some pre-existing characters of the input? The PHP setup does not overload multi-byte string functions.

Can anyone comment? Thanks.

str_replace works fine, as long as all strings passed to it are in the same encoding. It just does a binary compare/replace of data, so the actual encoding doesn't really matter.

That's why there's no mb_str_replace in this list .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM