简体   繁体   English

正则表达式删除除表情符号以外的所有非字母数字字符

[英]regex remove all non alphanumeric characters except emoticons

I need to remove all non alphanumeric characters except spaces and allowed emoticons. 我需要删除除空格和允许表情符号之外的所有非字母数字字符。

Allowed emoticons are :) , :( , :P etc (the most popular). 允许的表情符号是:):( :P等(最受欢迎)。

I have a string: 我有一个字符串:

$string = 'Hi! Glad # to _ see : you :)';

so I need to process this string and get the following: 所以我需要处理这个字符串并获得以下内容:

$string = 'Hi Glad to see  you :)';

Also please pay attention emoticons can contain spaces 另外请注意表情符号可以包含空格

eg 例如

: ) instead of :) 而不是:)):

or 要么

: P instead of :P :P代替:P

Does anyone have a function to do this? 有人有这个功能吗?

If someone helped me it would be so great :) 如果有人帮助过我会很棒:)

UPDATE UPDATE

Thank you very much for your help. 非常感谢您的帮助。

buckley offered ready solution, 巴克利提供现成的解决方案

but if string contains emoticons with spaces 但如果字符串包含带空格的表情符号

eg Hi! 嗨! Glad # to _ see : you : ) 很高兴见到你 : )

result is equal to Hi Glad to see you 结果等于嗨很高兴见到你

as you see emoticon : ) was cut off. 正如你看到的表情:)被切断。

I don't "speak" php ;) but this does it in JS. 我不“说”php;)但这是在JS中做到的。 Maybe you can convert it. 也许你可以转换它。

var sIn = 'Hi! Glad # to _ see : you :)',
    sOut;

sOut = sIn.match(/([\w\s]|: ?\)|: ?\(|: ?P)*/g).join('');

It works the otherway around from your attempt - it finds all "legal" characters/combinations and joins them together. 它在您尝试的其他方面起作用 - 它找到所有“合法”字符/组合并将它们连接在一起。

Regards 问候

Edit: Updated regex to handle optional spaces in emoticons (as commented earlier). 编辑:更新了正则表达式以处理表情符号中的可选空格(如前面所述)。

Ha! 哈! This one was interesting 这个很有意思

Replace 更换

(?!(:\)|:\(|:P))[^a-zA-Z0-9 ](?<!(:\)|:\(|:P))

With nothing 什么都没有

The idea is that you sandwich the illegal characters with the same regex once as a negative lookhead and once as negative lookbehind. 我们的想法是,你使用相同的正则表达式将非法字符夹在一次作为负面看起来,而将其作为负面看起来。

The result will have consecutive spaces in it. 结果将包含连续的空格。 This is something that a regex cannot do in 1 sweep AFAIK cause it can't look at multiple matches at once. 这是正则表达式无法在1次扫描AFAIK中执行的操作,因为它无法同时查看多个匹配项。

To eliminate the consecutive spaces you can replace \\s+ with 要消除连续的空格,可以用\\s+替换\\s+ (an empty space) (一个空的空间)

Here is an updated answer that meets the new requirement that an emoticon can contain a space 这是一个更新的答案,符合表情符号可以包含空格的新要求

Replace 更换

((:\))|(:\()|(:P)|(: \))|: P)|[^0-9a-zA-Z\r\n ]

With

$1

Formatted in free spacing mode this becomes 在自由间隔模式下格式化成为

(?x)
(
  (?::\))|
  (?::\()|
  (?::P)|
  (?::\ \))|
  :\ P
)|
[^0-9a-zA-Z\r\n ]

In PHP 在PHP中

$result = preg_replace('/((:\))|(:\()|(:P)|(: \))|: P)|[^0-9a-zA-Z\r\n ]/', '$1', $subject);

The idea is that we start the regex with the emoticons that are contain multiple characters which individually can contain an illegal character. 我们的想法是,我们使用包含多个字符的表情符号启动正则表达式,这些表情符号可以包含非法字符。

This group is captured and later used as a replacement $1 该组被捕获,后来用作替换$ 1

Then, after the alternation, we use a whitelist of characters that we negate so it will be matched but won't be mentioned in the replaced pattern. 然后,在交替之后,我们使用我们否定的字符白名单,因此它将被匹配,但在替换的模式中将不会被提及。

Everything that is not matched (our whitelist) will be repeated in the result as is the convention. 所有不匹配的东西(我们的白名单)将按照惯例在结果中重复。

On thing to not is that there is a lot of grouping when listing the emoticons which can hinder performance. 关于没有的事情是在列出可能妨碍性能的表情符号时有很多分组。 To prevent this we can make the regex a bit more verbose: 为了防止这种情况,我们可以使正则表达式更冗长:

 ((?::\))|(?::\()|(?::P)|(?:: \))|: P)|[^0-9a-zA-Z\r\n ]

The multiple consecutive spaces remain and can't be solved in 1 sweep AFAIK. 多个连续空间保留,并且无法在1次扫描AFAIK中求解。

Here is a string formatter that could do the job making the assumption that emoticons are 2 characters long in general: 这是一个字符串格式化程序,可以完成这项工作,假设表情符号通常为2个字符长:

<?php

class StringFormatter
{
  private $blacklist;
  private $whitelist;

  public function __construct(array $blacklist, array $whitelist)
  {
    $this->blacklist = $blacklist;
    $this->whitelist = $whitelist;
  }

  public function format($str)
  {
    $strLen = strlen($str);

    $result = '';
    $counter = 0;
    while ($counter < $strLen) {
      // get a character from the string
      $char = substr($str, $counter, 1);

      // if not blacklisted, allow it in the result
      if (!in_array($char, $this->blacklist)) {
        $result .= $char;
        $counter++;
        continue;
      }

      // if we reached the last letter, break out of the loop
      if ($counter >= $strLen - 1) {
        break;
      }

      // we assume all whitelisted entries have same length (e.g. 2
      // for emoticons)
      if (in_array(substr($str, $counter, 2), $this->whitelist)) {
        $result .= substr($str, $counter, 2);
        $counter += 2;
      } else {
        $counter++;
      }
    }

    return $result;
  }
}

// example usage
// $whitelist is not the entire whitelist, actually it's the exceptions
// to the blacklist, so more complext strings including blacklisted  characters that should be allowed
$formatter = new StringFormatter(['#', '_', ':', '!'], [':)', ':(']);
echo $formatter->format('Hi! Glad # to _ see : you :)');

The code above can be further refactored to be cleaner, but you get the picture. 上面的代码可以进一步重构为更清晰,但你得到的图片。

I'd use this regex, 我用这个正则表达式,

(?i)(:\s*[)p(])(*SKIP)(*FAIL)|[^a-z0-9 ]

Demo: https://regex101.com/r/nW6iL3/2 演示: https//regex101.com/r/nW6iL3/2

PHP Usage: PHP用法:

$string = ':     ) instead of :)

or

:     P instead of :P

Hi! Glad # to _ see : you :)';

echo preg_replace('~(?i)(:\s*[)p(])(*SKIP)(*FAIL)|[^a-z0-9 ]~', '', $string);

Output: 输出:

: ) instead of :)or: P instead of :PHi Glad to see you :) :)而不是:)或:P代替:PHi很高兴见到你:)

Demo: https://eval.in/416394 演示: https//eval.in/416394

If the closing part of the emoticon changes or you have others you can add them inside this character class [)p(] . 如果表情符号的结束部分发生变化,或者您有其他表情符号,则可以将其添加到此字符类[)p(]

You also could change the eyes by changing the : to a character class so you could do 您还可以通过将:更改为角色类来改变眼睛,这样您就可以做到

(?i)([:;]\s*[)p(])(*SKIP)(*FAIL)|[^a-z0-9 ] 

If you also wanted to allow the winking faces (I think semicolon is the wink).. 如果你也想允许眨眼的脸(我认为分号是眨眼)..

Update 更新

Bit by bit explanation... 一点一点解释......

(?i) = make the regex case insensitive (?i) =使正则表达式不敏感

: = search for the eyes (a colon) : =寻找眼睛(冒号)

\\s* = search for zero or more (the * is 0 or more of the preceding character) whitespace characters ( \\h might be better here, \\s includes new lines and tabs) \\s* =搜索零或更多(*是前面字符的0或更多)空格字符( \\h在这里可能更好, \\s包括新行和制表符)

[)p(] = this is a character class allowing any of the characters inside it to be present. so ) , p , or ( are allow allowed here. [)p(] =这是一个允许其中任何字符出现的字符类.so )p(允许在这里允许)。

(*SKIP)(*FAIL) = if we found the previous regex ignore it, www.rexegg.com/regex-best-trick.html. (*SKIP)(*FAIL) =如果我们发现之前的正则表达式忽略了它,www.rexegg.com / regex-best -trick.html。

| = or =或

[^a-z0-9 ] - a negated character class meaning any character not in this list find. [^a-z0-9 ] - 一个否定的字符类,表示此列表中没有的任何字符。

The regex101 also has documentation on the regex. regex101还有正则表达式的文档。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM