PHP: strpos & substr with UTF-8

Question

Say I have a long UTF-8 encoded string.

And say I want to detect if $var exists in this string.

Assuming $var is always going to be simple letters or numbers of ascii characters (eg "hello123" ) I shouldn't need to use mb_strpos or iconv_strpos right? Because it doesn't matter if the position is not character-wise correct as long as its consistent with the other functions.

Example:

$var='hello123';
$pos=strpos($utf8string,$var);
if ($pos!==false) $uptohere=substr($ut8string,0,$pos);

Am I correct that the above code will extract everything up to 'hello123' regardless of whether the string contains fancy UTF-8 characters? My logic is that because both strpos and substr will be consistent with each other (even if this is consistently wrong) then it should still work.

Answer 1

Yes, you are correct. There's no ambiguity about the characters themselves, ie hello123 can't possibly anything else in UTF-8. The way you're slicing it, it doesn't matter whether you're slicing by character or by byte number.

So yes, this is safe, as long as your string is UTF-8 and thereby ASCII compatible .

See here for quick test: http://3v4l.org/XnM8s

Why this works:

The string "漢字hello123" in UTF-8 looks like this as bytes (I hope this aligns correctly):

e6 | bc | a2 | e5 | ad | 97 | 68 | 65 | 6c | 6c | 6f | 31 | 32 | 33
     漢      |      字      | h  | e  | l  | l  | o  | 1  | 2  | 3

strpos will look for the byte sequence 68656c6c6f313233 , returning 6 as the starting byte of "hello123". substr will slice 6 bytes from byte 0 , returning "漢字". There is no ambiguity. You're finding and slicing by bytes, it doesn't matter how many characters there are.

You need to either work entirely in characters , in which case the string functions must be encoding aware. Or you work entirely in bytes , in which case the only requirement is that bytes aren't ambiguous (say "hello123" could match "中国" encoded in BIG5, because the bytes are the same (they don't, just an example)). UTF-8 is self-synchronizing , meaning there's no such ambiguity.

Answer 2

In UTF-8 you must use mb_* functions, in your case you need replace substr to

mb_substr($var, 0, N, 'UTF-8');

mb_substr()

PHP: strpos & substr with UTF-8

Question

2 answers

solution1
9 ACCPTED 2013-02-24 10:24:30

solution2
3 2013-02-24 10:21:20

PHP: strpos & substr with UTF-8

Question

2 answers

solution1 9 ACCPTED 2013-02-24 10:24:30

solution2 3 2013-02-24 10:21:20

solution1
9 ACCPTED 2013-02-24 10:24:30

solution2
3 2013-02-24 10:21:20