简体   繁体   English

如何在PHP中用单个空格正确替换多个空格?

[英]How to correctly replace multiple white spaces with a single white space in PHP?

I was scouring through SO answers and found that the solution that most gave for replacing multiple spaces is: 我一直在搜寻SO答案,发现大多数替换多个空格的解决方案是:

$new_str = preg_replace("/\s+/", " ", $str);

But in many cases the white space characters include UTF characters that include line feed, form feed, carriage return, non-breaking space, etc. This wiki describes that UTF defines twenty-five characters defined as whitespace. 但是在许多情况下,空白字符包括UTF字符,包括换行符,换页符,回车符,不间断空格等。 此Wiki描述UTF定义了25个字符,这些字符定义为空白。

So how do we replace all these characters as well using regular expressions? 那么,如何使用正则表达式替换所有这些字符呢?

When passing u modifier, \\s becomes Unicode-aware. 传递u修饰符时, \\s可以识别Unicode。 So, a simple solution is to use 因此,一个简单的解决方案是使用

$new_str = preg_replace("/\s+/u", " ", $str);
                             ^^

See the PHP online demo . 参见PHP在线演示

The first thing to do is to read this explanation of how unicode can be treated in regex. 首先要做的是阅读这种关于如何在regex中处理unicode的解释。 Coming specifically to PHP, we need to first of all include the PCRE modifier 'u' for the engine to recognize UTF characters. 专门针对PHP,我们首先需要包括PCRE修饰符'u',以使引擎识别UTF字符。 So this would be: 因此,这将是:

$pattern = "/<our-pattern-here>/u";

The next thing is to note that in PHP unicode characters have the pattern \\x{00A0} where 00A0 is hex representation for non-breaking space . 接下来的事情是要注意在PHP的Unicode字符有模式\\x{00A0}其中00A0是十六进制表示non-breaking space So if we want to replace consecutive non-breaking spaces with a single space we would have: 因此,如果我们想用一个空格替换连续的不间断空格,我们将有:

$pattern = "/\x{00A0}+/u";
$new_str = preg_replace($pattern," ",$str);

And if we were to include other types of spaces mentioned in the wiki like: 如果我们要包括Wiki中提到的其他类型的空格,例如:

  • \\x{000D} carriage return \\x{000D}回车
  • \\x{000C} form feed \\x{000C}表单供稿
  • \\x{0085} next line \\x{0085}下一行

Our pattern becomes: 我们的模式变为:

$pattern = "/[\x{00A0}\x{000D}\x{000C}\x{0085}]+/u";

But this is really not great since the regex engine will take forever to find out all combinations of these characters. 但是,这确实不是一个好主意,因为正则表达式引擎将花很长时间才能找出这些字符的所有组合。 This is because the characters are included in square brackets [ ] and we have a + for one or more occurrences. 这是因为字符包含在方括号[]中,并且在一个或多个事件中带有+。

A better way to then get faster results is by replacing all occurrences of each of these characters by a normal space first. 一种获得更快结果的更好方法是,首先用正常空格替换所有出现的这些字符。 And then replacing multiple spaces with a single normal space. 然后用单个普通空间替换多个空间。 We remove the [ ]+ and instead separate the characters with the or operator | 我们删除[] +,而是使用or运算符|将字符分开| :

$pattern = "/\x{00A0}|\x{000D}|\x{000C}|\x{0085}/u";
$new_str = preg_replace($pattern," ",$str); // we have one-to-one replacement of character by a normal space, so 5 unicode chars give 5 normal spaces
$final_str = preg_replace("/\s+/", " ", $new_str); // multiple normal spaces now become single normal space

A pattern that matches all Unicode whitespaces is [\\pZ\\pC] . 匹配所有Unicode空格的模式[\\pZ\\pC] Here is a unit test to prove it . 这是证明它单元测试

If you're parsing user input in UTF-8 and need to normalize it, it's important to base your match on that list. 如果您要在UTF-8中解析用户输入并需要对其进行规范化,那么将匹配项基于该列表就很重要。 So to answer your question that would be: 因此,回答您的问题将是:

$new_str = preg_replace("/[\\pZ\\pC]+/u", " ", $str);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM