简体   繁体   English

用于捕获字母之间具有特殊字符的单词的正则表达式

[英]Regex for catching word with special characters between letters

I am new to regex, I'm programming an advanced profanity filter for a commenting feature (in C#).我是正则表达式的新手,我正在为评论功能(在 C# 中)编写一个高级亵渎过滤器。 Just to save time, I know that all filters can be fooled, no matter how good they are, you don't have to tell me that.只是为了节省时间,我知道所有的过滤器都可以被愚弄,不管它们有多好,你不必告诉我。 I'm just trying to make it a bit more advanced than basic word replacement.我只是想让它比基本的单词替换更先进一点。 I've split the task into several separate approaches and this is one of them.我已将任务分成几个单独的方法,这就是其中之一。

What I need is a specific piece of regex, that catches strings such as these:我需要的是一段特定的正则表达式,它可以捕获如下字符串:

s_h_i_t
s h i t
S<>H<>I<>T
s_/h_/i_/t
s***h***i***t

you get the idea.你明白了。 I guess what I'm looking for is a regex that says "one or more characters that are not alphanumeric".我想我正在寻找的是一个正则表达式,上面写着“一个或多个不是字母数字的字符”。 This should include both spaces and all special characters that you can type on a standard (western) keyboard.这应该包括空格和您可以在标准(西方)键盘上键入的所有特殊字符。 If possible, it should also include line breaks, so it would catch things like如果可能的话,它还应该包括换行符,这样它就会捕捉到类似的东西

s
h
i
t

There should always be at least one of the characters present, to avoid likely false positives such as in应始终至少存在一个字符,以避免可能的误报,例如

Finish it.

This will of course mean that things like这当然意味着像

sh_it

will not be caught, but as I said, it doesn't matter, it doesn't have to be perfect.不会被抓住,但正如我所说,没关系,它不必是完美的。 All I need is the regex, I can do the splitting of words and inserting the regex myself.我只需要正则表达式,我可以自己拆分单词并插入正则表达式。 I have the RegexOptions.IgnoreCase option set in my C# code, so character case in the actual word is not an issue.我在 C# 代码中设置了 RegexOptions.IgnoreCase 选项,因此实际单词中的字符大小写不是问题。 Also, this regex shouldn't worry about "leetspeek", ie some of the actual letters of the word being replaced by other characters:此外,这个正则表达式不应该担心“leetspeek”,即单词的一些实际字母被其他字符替换:

sh1t

I have a different approach that deals with that.我有一种不同的方法来处理这个问题。 Thank you in advance for your help.预先感谢您的帮助。

让我们看看这个正则表达式是否适合你:

/\w(?:_|\W)+/

\bs[\W_]*h[\W_]*i[\W_]*t[\W_]*(?!\w)

  • matches characters between letters that aren't word characters or character _ or whitespace characters (also new line breaks)匹配不是单词字符或字符_或空白字符的字母之间的字符(也是换行符)

  • \b (word boundrary) ensures that Finish it won't match \b (字边界)确保Finish it不会匹配

  • (?!\w) ensures that sh ituuu wont match, you may want to remove/modify that, as s_hittt will not match as well. (?!\w)确保 sh ituuu 不会匹配,您可能需要删除/修改它,因为s_hittt也不会匹配。 \bs[\W_]*h[\W_]*i[\W_]*t+[\W_]*(?!\w) will match the word with repeated last character \bs[\W_]*h[\W_]*i[\W_]*t+[\W_]*(?!\w)将匹配最后一个字符重复的单词

  • modification \bs[\W_]*h[\W_]*i[\W_]*t[\W_]*?(?!\w) will make the match of last character class not greedy and in sh it&&& only sh it will match修改\bs[\W_]*h[\W_]*i[\W_]*t[\W_]*?(?!\w)将使最后一个字符类的匹配不贪心,并且在sh it&&&中只sh it将匹配

  • \bs[\W\d_]*h[\W\d_]*i[\W\d_]*t+[\W\d_]*?(?!\w) will match sh1i444t (digits between characters) \bs[\W\d_]*h[\W\d_]*i[\W\d_]*t+[\W\d_]*?(?!\w)将匹配sh1i444t (字符之间的数字)

EDIT:编辑:

(?!\w) is a negative lookahead. (?!\w) 是一个负前瞻。 It basicly checks if your match is followed by a word character (word characters are [A-z09_]).它基本上检查您的匹配是否后跟一个单词字符(单词字符是 [A-z09_])。 It has a length of 0, which means it won't be included in the match.它的长度为 0,这意味着它不会包含在匹配中。 If you want to catch words like "s h i*tface" you'll have to remove it.如果你想捕捉像“s h i*tface”这样的词,你必须删除它。 ( http://www.regular-expressions.info/lookaround.html ) http://www.regular-expressions.info/lookaround.html

A word booundrary [/b] matches a place where word starts or ends, it's length is 0, which means that it matches between characters一个词的边界[/b] 匹配一个词的开始或结束的地方,它的长度是0,这意味着它匹配字符之间

[\W] is a negative character class, I think it's equal to [^a-zA-Z0-9_] or [^\w] [\W] 是一个否定字符类,我认为它等于 [^a-zA-Z0-9_] 或 [^\w]

Alright, HamZa's answer worked.好吧,HamZa 的回答奏效了。 However I ran into a programmatic problem while working on the solution.但是,我在研究解决方案时遇到了程序问题。 When I was replacing just the words, I always knew the length of the word.当我只替换单词时,我总是知道单词的长度。 So I knew exactly how many asterisks to replace it with.所以我确切地知道要用多少个星号来代替它。 If I'm matching shit , I know I need to put 4 asterisks.如果我匹配shit ,我知道我需要加上 4 个星号。 But if I'm matching s[^a-z0-9]+h[^a-z0-9]+[^a-z0-9]+i[^a-z0-9]+t , I might catch s#h#i#t or I may catch s------h------i--------t .但是如果我匹配s[^a-z0-9]+h[^a-z0-9]+[^a-z0-9]+i[^a-z0-9]+t ,我可能会抓住s#h#i#t否则我可能会catch s------h------i--------t In both cases the length of the matched text will differ wildly from that of the pattern.在这两种情况下,匹配文本的长度都会与模式的长度大不相同。 How can I get the actual length of the matched string?如何获得匹配字符串的实际长度?

You want to match words where each letter is separated with the identical non-word char(s).您想匹配每个字母用相同的非单词字符分隔的单词。

You can use您可以使用

\b\p{L}(?=([\W_]+))(?:\1\p{L})+\b

See the regex demo .请参阅正则表达式演示 (I added (?!\n) to make the regex work for each line as if it were a separate string.) Details : (我添加了(?!\n)以使正则表达式对每一行都起作用,就好像它是一个单独的字符串一样。)详细信息

  • \b - word boundary \b - 单词边界
  • \p{L} - a letter \p{L} - 一个字母
  • (?=([\W_]+)) - a positive lookahead that matches a location that is immediately followed with any non-word or _ char (captured into Group 1) (?=([\W_]+)) - 一个正向前瞻,它匹配紧随其后的任何非单词或_字符的位置(捕获到第 1 组)
  • (?:\1\p{L})+ - one or more repetitions of a sequence of the same char captured into Group 1 and a letter (?:\1\p{L})+ - 捕获到第 1 组的相同字符序列的一个或多个重复和一个字母
  • \b - word boundary. \b - 单词边界。

To check if there is such a pattern in a string, you can use要检查字符串中是否存在这样的模式,您可以使用

var HasSpamWords = Regex.IsMatch(text, @"\b\p{L}(?=([\W_]+))(?:\1\p{L})+\b");

To return all occurrences in a string, you can use要返回字符串中的所有匹配项,您可以使用

var results = Regex.Matches(text, @"\b\p{L}(?=([\W_]+))(?:\1\p{L})+\b")
    .Cast<Match>()
    .Select(x => x.Value)
    .ToList();

See the C# demo .请参阅C# 演示

Getting the length of each string is easy if you get Match.Length and use .Select(x => x.Length) .如果您获得Match.Length并使用.Select(x => x.Length) ,则获取每个字符串的长度很容易。 If you need to get the length of the string with all special chars removed, simply use .Select(x => x.Value.Count(c => char.IsLetter(c))) (see this C# demo ).如果您需要获取删除所有特殊字符的字符串长度,只需使用.Select(x => x.Value.Count(c => char.IsLetter(c))) (请参阅此 C# 演示)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM