简体   繁体   中英

Php substr Utf-8 issue

When I run this code

   $string='<p>Şelamiİnnşşasdüğ213,123wqeq.weqw.rqasd</p><p>Şelamiİnnşşasdüğ213,123wqeq.weqw.rqasd</p><p>Şelamiİnnşşasdüğ213,123wqeq.weqw.rqasd</p>';

echo substr(strip_tags(trim(html_entity_decode($string,   ENT_COMPAT, 'UTF-8'))), 0, 14);;

i get this result.

Şelamiİnnş

what is my mistake ?

Firstly, always break your problem down into smaller parts to see where it's going wrong:

$string=html_entity_decode($string,   ENT_COMPAT, 'UTF-8');
echo $string, "\n";
$string = trim($string);
echo $string, "\n";
$string = strip_tags($string);
echo $string, "\n";
$string = substr($string, 0, 14);
echo $string, "\n";

If you run that, you'll see that the problem has nothing to do with strip_tags , it has to do with substr .

The reason is very simple: strings in PHP are just a series of bytes; functions like substr don't count "characters" in any meaningful way. So substr($string, 0, 14) simply takes the first 14 bytes of the string, which in this case happens to split a "character" which was encoded as more than one byte, using UTF-8.

The most common solution to this is to use mb_substr (part of the "mbstring" PHP extension) which counts "characters" according to some encoding:

$string = mb_substr($string, 0, 14, 'UTF-8');
echo $string, "\n"; 
// Şelamiİnnşşasd

Note that this will truncate to 14 Unicode code points , so can still do odd things like chop an accent off a letter if it's been encoded using a "combining diacritic".

An alternative in some cases would be to use grapheme_substr (part of the "intl" extension) which splits on "graphemes", which are intended to be roughly what people would think of as a "character" or "letter". In this case, it gives the same result:

$string = grapheme_substr($string, 0, 14, 'UTF-8');
echo $string, "\n"; 
// Şelamiİnnşşasd

But in other cases, it might not:

$string = 'noël';
echo mb_substr($string, 0, 3, 'UTF-8'), "\n"; // noe
echo grapheme_substr($string, 0, 3), "\n"; // noë

You should use multi-byte substr() function.

Try

<?php
$string = '<p>Şelamiİnnşşasdüğ213,123wqeq.weqw.rqasd</p>p>Şelamiİnnşşasdüğ213,123wqeq.weqw.rqasd</p><p>Şelamiİnnşşasdüğ213,123wqeq.weqw.rqasd</p>';

echo mb_substr(strip_tags(trim(html_entity_decode($string,   ENT_COMPAT, 'UTF-8'))), 0, 14);;

?>

Ref | Demo

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM