[英]preg_replace_callback: including curly braces in a pattern: { is captured, } isn't
I have this function, which makes use of preg_replace_callback to split a sentence into "chains" of blocks belonging to different categories (alphabetic, han characters, everything else). 我具有此功能,该功能利用preg_replace_callback将句子拆分为属于不同类别(字母,汉字等等)的块的“链”。
The function is trying to also include the characters ' , { , and } as "alphabetic" 该函数试图将字符' , {和}包括为“字母”
function String_SplitSentence($string)
{
$res = array();
preg_replace_callback("~\b(?<han>\p{Han}+)\b|\b(?<alpha>[a-zA-Z0-9{}']+)\b|(?<other>[^\p{Han}A-Za-z0-9\s]+)~su",
function($m) use (&$res)
{
if (!empty($m["han"]))
{
$t = array("type" => "han", "text" => $m["han"]);
array_push($res,$t);
}
else if (!empty($m["alpha"]))
{
$t = array("type" => "alpha", "text" => $m["alpha"]);
array_push($res, $t);
}
else if (!empty($m["other"]))
{
$t = array("type" => "other", "text" => $m["other"]);
array_push($res, $t);
}
},
$string);
return $res;
}
However, something seems to be wrong with the curly braces. 但是,花括号似乎有问题。
print_r(String_SplitSentence("Many cats{1}, several rats{2}"));
As can be seen in the output, the function treats { as an alphabetic character, as indicated, but stops at } and treats it as "other" instead. 从输出中可以看出,该函数将{视为字母字符,如所示,但停在}并将其视为“其他”。
Array
(
[0] => Array
(
[type] => alpha
[text] => Many
)
[1] => Array
(
[type] => alpha
[text] => cats{1
)
[2] => Array
(
[type] => other
[text] => },
)
[3] => Array
(
[type] => alpha
[text] => several
)
[4] => Array
(
[type] => alpha
[text] => rats{2
)
[5] => Array
(
[type] => other
[text] => }
)
What am I doing wrong? 我究竟做错了什么?
I can't be completely sure, because your sample input doesn't represent any Chinese characters and I don't know what kind of fringe cases you may be trying to process, but this is how I would write the pattern: 我不能完全确定,因为您的示例输入不代表任何汉字,并且我不知道您可能尝试处理哪种附带情况,但这是我将如何编写模式:
~(?<han>\p{Han}+)|(?<alpha>[a-z\d{}']+)|(?<other>\S+)~ui
The trouble with \\b
is that it is looking for \\w
characters. \\b
的问题在于它正在寻找\\w
字符。 \\w
represents uppercase letters, lowercase letters, numbers, and underscores. \\w
表示大写字母,小写字母,数字和下划线。 Reference: https://stackoverflow.com/a/11874899/2943403 参考: https : //stackoverflow.com/a/11874899/2943403
Also your pattern doesn't include any .
而且您的模式不包含任何
.
s so you can remove the s
pattern modifier. s,因此您可以删除
s
模式修饰符。
Also your function call seems to be abusing preg_replace_callback()
. 另外,您的函数调用似乎正在滥用
preg_replace_callback()
。 I mean, you aren't actually replacing anything, so it is an inappropriate call. 我的意思是,您实际上并没有更换任何东西,因此这是不适当的电话。 Perhaps you could consider this rewrite:
也许您可以考虑以下重写:
function String_SplitSentence($string){
if(!preg_match_all("~(?<han>\p{Han}+)|(?<alpha>[a-z\d{}']+)|(?<other>\S+)~ui",$string,$out)){
return []; // or $string or false
}else{
foreach($out as $group_key=>$group){
if(!is_numeric($group_key)){ // disregard the indexed groups (which are unavoidably generated)
foreach($group as $i=>$v){
if(strlen($v)){ // only store the value in the subarray that has a string length
$res[$i]=['type'=>$group_key,'text'=>$v];
}
}
}
}
ksort($res);
return $res;
}
}
A demonstration about your pattern: https://regex101.com/r/6EUaSM/1 有关您的模式的演示: https : //regex101.com/r/6EUaSM/1
\\b after your character class was fouling it all up. \\ b在您的角色课弄坏了所有内容之后。
}
is not included in the \\w
class. }
不包含在\\w
类中。 Regex wants to do a good job for you -- it captured "greedily" until it couldn't anymore. Regex希望为您做好工作-它“贪婪地”捕获了它,直到它不再存在为止。 The
}
was getting left out because of the word boundary. }
由于单词边界而被排除在外。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.