简体   繁体   English

grapheme_strlen函数的奇怪行为与一些行结尾

[英]Strange behavior of grapheme_strlen function with some line endings

Can anyone explain this weird behavior of the Unicode strlen function in PHP's intl extension? 任何人都可以在PHP的intl扩展中解释Unicode strlen函数的奇怪行为吗?

var_dump(grapheme_strlen("a\r\n")); // (ASCII 'a') length: 3
var_dump(grapheme_strlen("の\r\n")); // length: 2
var_dump(grapheme_strlen("\r\n")); // length: 2

Seems like grapheme_strlen is counting "\\r\\n" (CR LF, which are two separate code points used for line separation on Windows) as a single grapheme, which could be quite reasonable considering the name of the function, but it does it only if the line ending is preceded by a non-ASCII character. 看起来像grapheme_strlen正在计算“\\ r \\ n”(CR LF,它是用于Windows上行分隔的两个独立代码点)作为单个字形,考虑到函数的名称,这可能是非常合理的,但它只做它如果行结尾前面是非ASCII字符。 Why? 为什么?

This is a bug. 这是一个错误。 grapheme_strlen should work on the Grapheme Cluster Boundaries defined in Unicode Standard Annex #29 (Unicode Text Segmentation) . grapheme_strlen应该适用于Unicode标准附件#29(Unicode文本分段)中定义的grapheme_strlen集群边界。 The standard clearly says not to break between CR and LF. 该标准明确表示不要在CR和LF之间打破。

If you have a look at the PHP source, grapheme_strlen simply returns the number of characters for ASCII strings. 如果你看一下PHP源代码, grapheme_strlen返回 ASCII字符串的字符数

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM