简体   繁体   English

PHP正则表达式 - 删除所有非字母数字字符

[英]PHP Regular expression - Remove all non-alphanumeric characters

I use PHP. 我用PHP。

My string can look like this 我的字符串看起来像这样

This is a string-test width åäö and some über+strange characters: _like this?

Question

Is there a way to remove non-alphanumeric characters and replace them with a space? 有没有办法删除非字母数字字符并用空格替换它们? Here are some non-alphanumeric characters: 以下是一些非字母数字字符:

  • - -
  • + +
  • :
  • _ _
  • ?

I've read many threads about it but they don't support other languages, like this one: 我已经阅读了很多关于它的线索,但它们不支持其他语言,例如:

preg_replace("/[^A-Za-z0-9 ]/", '', $string);

Requirements 要求

  • My list of none letter characters might not be complete. 我的无字母字符列表可能不完整。
  • My content contain characters in different languages, like åäöü. 我的内容包含不同语言的字符,例如åäöü。 Could be very many more. 可能会更多。
  • The non-alphanumeric characters should be replaced with a space. 非字母数字字符应替换为空格。 Else the word would be glued to eachother. 否则这个词就会粘在一起。

You can try this: 你可以试试这个:

preg_replace('~[^\p{L}\p{N}]++~u', ' ', $string);

\\p{L} stands for all alphabetic characters (whatever the alphabet). \\p{L}代表所有字母字符(无论字母表)。

\\p{N} stands for numbers. \\p{N}代表数字。

With the u modifier characters of the subject string are treated as unicode characters. 使用主题字符串的u修饰符字符被视为unicode字符。

Or this: 或这个:

preg_replace('~\P{Xan}++~u', ' ', $string);

\\p{Xan} contains unicode letters and digits. \\p{Xan}包含unicode字母和数字。

\\P{Xan} contains all that is not unicode letters and digits. \\P{Xan}包含所有不是unicode字母和数字。 (Be careful, it contains white spaces too that you can preserve with ~[^\\p{Xan}\\s]++~u ) (小心,它也包含空格,你可以保存~[^\\p{Xan}\\s]++~u

If you want a more specific set of allowed letters you must replace \\p{L} with ranges in unicode table . 如果您想要一组更具体的允许字母,则必须将\\p{L}替换为unicode表中的范围。

Example: 例:

preg_replace('~[^a-zÀ-ÖØ-öÿŸ\d]++~ui', ' ', $string);

Why using a possessive quantifier (++) here? 为什么在这里使用占有量词(++)?

~\\P{Xan}+~u will give you the same result as ~\\P{Xan}++~u . ~\\P{Xan}+~u会得到与~\\P{Xan}++~u相同的结果。 The difference here is that in the first the engine records each backtracking position (that we don't need) when in the second it doesn't (as in an atomic group). 这里的区别在于,在第一个引擎记录每个回溯位置(我们不需要),而在第二个时它没有(如在原子组中)。 The result is a small performance profit. 结果是小的性能利润。

I think it's a good practice to use possessive quantifiers and atomic groups when it's possible. 我认为在可能的情况下使用占有量词和原子群是一种很好的做法。

However, the PCRE regex engine makes automatically a quantifier possessive in obvious situations (example: a+b => a++b ) except If the PCRE module has been compiled with the option PCRE_NO_AUTO_POSSESS. 但是,PCRE正则表达式引擎在明显的情况下自动成为量词占有者(例如: a+b => a++b ),除非PCRE模块已使用选项PCRE_NO_AUTO_POSSESS进行编译。 ( http://www.pcre.org/pcre.txt ) http://www.pcre.org/pcre.txt

More informations about possessive quantifiers and atomic groups here (possessive quantifiers) and here (atomic groups) or here 关于占有量词和原子群的更多信息(占有量词)这里(原子团)这里

Are you perhaps looking for \\W ? 你也许正在寻找\\W

Something like: 就像是:

/[\W_]*/

Matches all non-alphanumeric character and underscores. 匹配所有非字母数字字符和下划线。

\\w matches all word character (alphabet, numeric, underscores) \\w匹配所有单词字符(字母,数字,下划线)

\\W matches anything not in \\w . \\W匹配不在\\w中的任何内容。

So, \\W matches any non-alphanumeric characters and you add the underscore since \\W doesn't match underscores. 因此, \\W匹配任何非字母数字字符并添加下划线,因为\\W与下划线不匹配。

EDIT: This make your line of code become: 编辑:这使您的代码行成为:

preg_replace("/[\W_]*/", ' ', $string);

The ' ' means that all matching characters (those not letter and not number) will become white spaces. ' '表示所有匹配的字符(非字母而非数字)将变为空格。

reEDIT: You might additionally want to use another preg_replace to remove all the consecutive spaces and replace them with a single space, otherwise you'll end up with: reEDIT:您可能还想使用另一个preg_replace来删除所有连续的空格并用一个空格替换它们,否则您将最终得到:

This is a string test width     and some  ber strange characters   like this 

You can use: 您可以使用:

preg_replace("/\s+/", ' ', $string);

And lastly trim the beginning and end spaces if any. 最后修剪起始和结束空格(如果有的话)。

I am not entirely sure which variety of regex you are using. 我不完全确定你正在使用哪种正则表达式。 However, POSIX regexes allow you to express an alphabetical class, where [:alpha:] represents any alphabetic character. 但是,POSIX正则表达式允许您表示按字母顺序排列的类,其中[:alpha:]表示任何字母字符。

So try: 所以尝试:

preg_replace("/[^[:alpha:]0-9 ]/", '', $string);

Actually, I forgot about [:alnum:] - that makes it simpler: 实际上,我忘记了[:alnum:] - 这使得它变得更简单:

preg_replace("/[^[:alnum:] ]/", '', $string);

\\p{xx} is what you are looking for, I believe, see here \\p{xx}正是你要找的,我相信, 看到这里

So, try: 所以,试试:

preg_replace("/\P{L}+/u", ' ', $string);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM