简体   繁体   English

使用PHP的str_replace函数替换UTF-16编码的字符串中的低ASCII字符

[英]Replacing low ASCII characters in UTF-16-encoded string using PHP's str_replace function

I have some PHP code that I use for text filtering. 我有一些用于文本过滤的PHP代码。 During filtering, some ASCII characters such as ampersand (&) and tilde (~) are temporarily converted to low ASCII characters (such as decimal code-points 4 and 5). 在过滤过程中,一些ASCII字符(例如与号(&)和代字号(〜))被临时转换为低ASCII字符(例如十进制代码点4和5)。 Just before the final filtered output is generated, the conversion is reverted. 在生成最终的过滤输出之前,将还原转换。

$temp = str_replace(array('&', '~'), array("\x04", "\x05"), $input);
... some filtering code to work with $temp ...
$out = str_replace(array("\x04", "\x05"), array('&', '~'), $temp);

This works well with input text of character encodings that use 8-bit code units such as UTF-8 and ISO 8859-1. 这对于使用8位代码单元(例如UTF-8和ISO 8859-1)的字符编码的输入文本效果很好。 But I am not sure about input encoded in larger code units, such as UTF-16 or UTF-32. 但是我不确定以更大的代码单元(例如UTF-16或UTF-32)编码的输入。 Will the first conversion step mangle the well-formedness of the input text? 第一步转换会破坏输入文本的格式吗? Will there be some conflict during the reversion step because of some pre-existing characters of the input? 由于某些先前存在的输入字符,在还原步骤期间是否会有一些冲突? The PHP setup does not overload multi-byte string functions. PHP安装程序不会重载多字节字符串函数。

Can anyone comment? 谁能评论? Thanks. 谢谢。

str_replace works fine, as long as all strings passed to it are in the same encoding. 只要传递给它的所有字符串都使用相同的编码,str_replace即可正常工作。 It just does a binary compare/replace of data, so the actual encoding doesn't really matter. 它只是对数据进行二进制比较/替换,因此实际编码并不重要。

That's why there's no mb_str_replace in this list . 这就是为什么此列表中没有mb_str_replace的原因。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM