简体   繁体   中英

Replacing non UTF8 characters

In php, I need to replace all non-UTF8 characters in a string. However, not by some equivalent (like the iconv function with //TRANSLIT ) but by some chosen character (like "_" or "*" for example).

Typically I want the user to be able to see the position were the invalid characters were found.

I didn't find any functions that do this, so I was going to use:

  • use iconv with //IGNORE
  • do a diff on the two strings and insert the wanted character where the non-UTF8 ones where

Do you see a better way to do that, is there some functions in php that can be combined to have this behavior ?

Thanks for you help.

Here are 2 functions to help you achieve something close to what you want :

//reject overly long 2 byte sequences, as well as characters above U+10000 and replace with ?
$some_string = preg_replace('/[\x00-\x08\x10\x0B\x0C\x0E-\x19\x7F]'.
 '|[\x00-\x7F][\x80-\xBF]+'.
 '|([\xC0\xC1]|[\xF0-\xFF])[\x80-\xBF]*'.
 '|[\xC2-\xDF]((?![\x80-\xBF])|[\x80-\xBF]{2,})'.
 '|[\xE0-\xEF](([\x80-\xBF](?![\x80-\xBF]))|(?![\x80-\xBF]{2})|[\x80-\xBF]{3,})/S',
 '?', $some_string );

//reject overly long 3 byte sequences and UTF-16 surrogates and replace with ?
$some_string = preg_replace('/\xE0[\x80-\x9F][\x80-\xBF]'.
 '|\xED[\xA0-\xBF][\x80-\xBF]/S','?', $some_string );

note that you can change the replacement (which currently is '?' with anything else by changing the string located at preg_replace('blablabla', **'?'**, $some_string)

the original article : http://magp.ie/2011/01/06/remove-non-utf8-characters-from-string-with-php/

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM