简体   繁体   English

正则表达式中的特殊字符问题

[英]Special character issue in regular expression

I'm trying to create a regular expression based on a list of Banned Words. 我正在尝试根据被禁单词列表创建一个正则表达式。 This will be compared against a string to find the banned words. 它将与字符串进行比较以查找被禁止的单词。 No sub-words should be found. 找不到子词。

The banned words will also be modified to include other characters that could be substituted to take the place of a letter such as "@" or "!" 禁止的单词也将被修改为包括其他可以替换字母的字符,例如“ @”或“!”。 in viagra; 在伟哥; "v!@gra" “V!@gra”

So I have a string, I search it for a word. 所以我有一个字符串,我在其中搜索一个单词。 I then write the regular expression using a word boundary to include all possible other characters. 然后,我使用单词边界编写正则表达式,以包括所有可能的其他字符。

This works until I come across needing to find a special character. 在我遇到需要查找特殊字符之前,此方法一直有效。 I realize with word boundaries that it will not find a regular character the same way- but I'm not sure on a good alternative. 我意识到使用单词边界时,它不会以相同的方式找到常规字符-但我不确定是否有很好的选择。

Pseudocode: 伪代码:

string ReviewText = "$uck";
string BannedWord = "suck";
string regexInput = "";

if (BannedWord .Contains("s") || BannedWord .Contains("S"))
{
    BannedWord = BannedWord .Replace("s", "[$s25]");
    BannedWord = BannedWord .Replace("S", "[$s25]");
}

regexInput = @"\b" + bannedWord + @"\b";

That should create \\b[$s25]uck\\b . 那应该创建\\b[$s25]uck\\b

I realize that this is bad since it is using a word boundary on a special character- but I'm not sure how to accomplish what I want for all normal characters without it. 我意识到这很不好,因为它在特殊字符上使用了字边界-但是我不确定如果没有它,如何为所有普通字符完成我想要的工作。

Is there a combination of things somehow that I can do in order to fix this issue? 为了解决此问题,我可以做一些事情组合吗? I've tried all I can think of. 我已经尽力了。

Basically I'm trying to create a moderation tool based on a word-list, and generate the regular expression on the fly. 基本上,我试图基于单词列表创建审核工具,并即时生成正则表达式。 Now I just need it to work in cases of special characters as well. 现在,我只需要它也可以用于特殊字符。

The problem is, the number of special characters and sub-phrases is near limitless. 问题是,特殊字符和子短语的数量几乎是无限的。 Multi-character representations are also problematic. 多字符表示法也是有问题的。

For example: |-|acking or /iagra 例如:|-| acking或/ iagra

(Doubly difficult because the string lengths don't match) (因为字符串长度不匹配,所以非常困难)

Also, the requirement that no sub-words should be found, means that you'll not block interesting new phrases as well. 同样,不应该找到任何子词的要求也意味着您也不会阻止有趣的新短语。 For example, calling someone a "pigf**ker" will be every offensive, but not picked up by your algorithm. 例如,称某人为“猪猪”将是每一次进攻,但不会被您的算法所接受。

The family or complexity of regexes you'll need is going to grow considerably. 您需要的正则表达式的种类或复杂性将大大增加。 You may want to think about a primitive (or not so primitive) tokenization / normalization approach. 您可能需要考虑一种原始的(或不是那么原始的)令牌化/规范化方法。 Otherwise, you'll have no chance in catching things like "f * * k". 否则,您将没有机会抓住“ f * * k”之类的东西。

This type of problem is more art than science, and while you'll be able to help admins, I'm not sure you'll be able to do it all 100% automatically. 这种类型的问题比科学更多,是艺术,虽然您可以帮助管理员,但我不确定您是否可以100%自动完成所有工作。 Be sure to leave room in your project for a reporting system. 确保在项目中为报表系统留出空间。 They're hard to get away from. 他们很难摆脱。

Is there a combination of things somehow that I can do in order to fix this issue?

Yes .. 是的..

Dot-Net can do expression yes/no conditionals. 点网可以表达是/否条件。 Using that info you can 使用该信息,您可以
still construct your regexInput string the same way, just replace the \\b with the 仍然以相同的方式构造regexInput字符串,只需将\\b替换为
appropriate conditional. 有条件的。

This way you are free to replace any character in Bannedword to anything else 这样,您可以随意将Bannedword中的任何字符替换为其他任何字符
without ever worrying about boundry conditions. 无需担心边界条件。

Example regex string result: 正则表达式字符串结果示例:

 # (?(?=\w)\b|\B)[$s25]uck(?(?<=\w)\b|\B)

 (?(?= \w )  # Conditional, is next letter a word
      \b          # yes, word boundry
   |  \B          # no, not word boundry
 )
 [$s25] uck 

 (?(?<= \w )  # Conditional, was prev letter a word
      \b          # yes, word boundry
   |  \B          # no, not word boundry
 )

Just change your Pseudocode to : 只需将您的伪代码更改为:

string ReviewText = "$uck";
string BannedWord = "suck";
string regexInput = "";

if (BannedWord .Contains("s") || BannedWord .Contains("S"))
{
    BannedWord = BannedWord .Replace("s", "[$s25]");
    BannedWord = BannedWord .Replace("S", "[$s25]");
}

regexInput = @"(?(?=\w)\b|\B)" + bannedWord + @"(?(?<=\w)\b|\B)";

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM