简体   繁体   中英

Special character issue in regular expression

I'm trying to create a regular expression based on a list of Banned Words. This will be compared against a string to find the banned words. No sub-words should be found.

The banned words will also be modified to include other characters that could be substituted to take the place of a letter such as "@" or "!" in viagra; "v!@gra"

So I have a string, I search it for a word. I then write the regular expression using a word boundary to include all possible other characters.

This works until I come across needing to find a special character. I realize with word boundaries that it will not find a regular character the same way- but I'm not sure on a good alternative.

Pseudocode:

string ReviewText = "$uck";
string BannedWord = "suck";
string regexInput = "";

if (BannedWord .Contains("s") || BannedWord .Contains("S"))
{
    BannedWord = BannedWord .Replace("s", "[$s25]");
    BannedWord = BannedWord .Replace("S", "[$s25]");
}

regexInput = @"\b" + bannedWord + @"\b";

That should create \\b[$s25]uck\\b .

I realize that this is bad since it is using a word boundary on a special character- but I'm not sure how to accomplish what I want for all normal characters without it.

Is there a combination of things somehow that I can do in order to fix this issue? I've tried all I can think of.

Basically I'm trying to create a moderation tool based on a word-list, and generate the regular expression on the fly. Now I just need it to work in cases of special characters as well.

The problem is, the number of special characters and sub-phrases is near limitless. Multi-character representations are also problematic.

For example: |-|acking or /iagra

(Doubly difficult because the string lengths don't match)

Also, the requirement that no sub-words should be found, means that you'll not block interesting new phrases as well. For example, calling someone a "pigf**ker" will be every offensive, but not picked up by your algorithm.

The family or complexity of regexes you'll need is going to grow considerably. You may want to think about a primitive (or not so primitive) tokenization / normalization approach. Otherwise, you'll have no chance in catching things like "f * * k".

This type of problem is more art than science, and while you'll be able to help admins, I'm not sure you'll be able to do it all 100% automatically. Be sure to leave room in your project for a reporting system. They're hard to get away from.

Is there a combination of things somehow that I can do in order to fix this issue?

Yes ..

Dot-Net can do expression yes/no conditionals. Using that info you can
still construct your regexInput string the same way, just replace the \\b with the
appropriate conditional.

This way you are free to replace any character in Bannedword to anything else
without ever worrying about boundry conditions.

Example regex string result:

 # (?(?=\w)\b|\B)[$s25]uck(?(?<=\w)\b|\B)

 (?(?= \w )  # Conditional, is next letter a word
      \b          # yes, word boundry
   |  \B          # no, not word boundry
 )
 [$s25] uck 

 (?(?<= \w )  # Conditional, was prev letter a word
      \b          # yes, word boundry
   |  \B          # no, not word boundry
 )

Just change your Pseudocode to :

string ReviewText = "$uck";
string BannedWord = "suck";
string regexInput = "";

if (BannedWord .Contains("s") || BannedWord .Contains("S"))
{
    BannedWord = BannedWord .Replace("s", "[$s25]");
    BannedWord = BannedWord .Replace("S", "[$s25]");
}

regexInput = @"(?(?=\w)\b|\B)" + bannedWord + @"(?(?<=\w)\b|\B)";

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM