简体   繁体   中英

PHP unexpected output when replacing character code 8217 in string

I'm running into an unexpected character replacement problem. The character code is 8217, ' .

I've tried escaping the character with a slash, but it didn't make a difference.

php > $a = preg_replace('/([.,\'"’:?!])<\/a>/', '</a>\1', 'letter">Evolution’</a> </li>');
php > echo($a);
// => letter">Evolution/a> </li>

// Just to show that it works if the character is different
php > $a = preg_replace('/([.,\'"’:?!])<\/a>/', '</a>\1', 'letter">Evolution"</a> </li>');
php > echo($a);
letter">Evolution</a>" </li>

I would expect it to output

letter">Evolution</a>' </li>

instead of

letter">Evolution/a> </li>

By default, pcre (the php regex engine) considers your pattern as a succession of single byte encoded characters. So when you write ['] you obtain a character class with the three bytes on which THE RIGHT SINGLE QUOTATION MARK (U+2019) is encoded, ie: \\xE2 , \\x80 , \\x99 .

In other words, writting "/[']/" in this default mode is like writting "/[\\xE2\\x80\\x99]/" or "/[\\x80\\xE2\\x99]/" or "/[\\x99\\xE2\\x80]/" etc., the regex engine doesn't see a sequence of bytes that represents the character ' but only three bytes.

This is the reason why you obtain a strange result, because [.,\\'"':?!] will only match the last byte of ' so \\x99 .

To solve the problem, you have to force the regex engine to read your pattern as an UTF-8 encoded string. You can do that with one of this ways:

  • preg_replace('~(*UTF)([.,\\'"':?!])</a>~', '</a>\\1', 'letter">Evolution'</a> </li>');
  • preg_replace('~([.,\\'"':?!])</a>~u', '</a>\\1', 'letter">Evolution'</a> </li>');

This time the three bytes \\xE2\\x80\\x99 are seen as an atomic sequence for the character ' .

Notice: (*UTF) is only for the reading of the pattern but the u modifier does more things: it extends shorthand character classes (like \\s , \\w , \\d ) to unicode characters and checks if the subject string is utf-8 encoded.

Just add unicode flag to the regex:

$a = preg_replace('/([.,\'"’:?!])<\/a>/u', '</a>\1', 'letter">Evolution’</a> </li>');
#                              here ___^
echo($a); 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM