简体   繁体   中英

PHP trim special character destroys other special character

I'm using this function to clean strings for elastic search:

function cleanString($string){
    $string = mb_convert_encoding($string, "UTF-8");
    $string = str_ireplace(array('<', '>'), array(' <', '> '), $string);
    $string = strip_tags($string);
    $string = filter_var($string, FILTER_SANITIZE_STRING);
    $string = str_ireplace(array("\t", "\n", "\r", "&nbsp;"," &shy;",":"), ' ', $string);
    $string = str_ireplace(array("&shy;","&laquo;","&raquo;","&pound;"), '', $string);
    return trim($string, ",;.:-_*+~#'\"´`!§$%&/()=?«»")
}

It does all sorts of stuff, but the problem I am facing has to do with the trim function at the very end. It is supposed to trim away whitespaces and special characters, and worked fine until recently, when I added two more special character to trim away from string: « and » . This caused problems with another special character:

When I pass the word België into the function, the ë gets corrupted and elastic throws an error.

  • Why does trim corrupt a completely different character?
  • How can I fix that, so that I parse out « and » and preserve ë ?

trim is not encoding aware and just looks at individual bytes. If you tell it to trim '«»' , and that's encoded in UTF-8, it will look for the bytes C2 AB C2 BB (where C2 is redundant, so AB BB C2 are the actual search terms). "ë" in UTF-8 is C3 AB , so half of it gets removed and the character is thereby broken.

You'll need to use an encoding aware functions to safely remove multibyte characters, eg:

preg_replace('/^[«»]+|[«»]+$/u', '', $str)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM