用于捕获字母之间具有特殊字符的单词的正则表达式

Question

I am new to regex, I'm programming an advanced profanity filter for a commenting feature (in C#).我是正则表达式的新手，我正在为评论功能（在 C# 中）编写一个高级亵渎过滤器。 Just to save time, I know that all filters can be fooled, no matter how good they are, you don't have to tell me that.只是为了节省时间，我知道所有的过滤器都可以被愚弄，不管它们有多好，你不必告诉我。 I'm just trying to make it a bit more advanced than basic word replacement.我只是想让它比基本的单词替换更先进一点。 I've split the task into several separate approaches and this is one of them.我已将任务分成几个单独的方法，这就是其中之一。

What I need is a specific piece of regex, that catches strings such as these:我需要的是一段特定的正则表达式，它可以捕获如下字符串：

s_h_i_t
s h i t
S<>H<>I<>T
s_/h_/i_/t
s***h***i***t

you get the idea.你明白了。 I guess what I'm looking for is a regex that says "one or more characters that are not alphanumeric".我想我正在寻找的是一个正则表达式，上面写着“一个或多个不是字母数字的字符”。 This should include both spaces and all special characters that you can type on a standard (western) keyboard.这应该包括空格和您可以在标准（西方）键盘上键入的所有特殊字符。 If possible, it should also include line breaks, so it would catch things like如果可能的话，它还应该包括换行符，这样它就会捕捉到类似的东西

s
h
i
t

There should always be at least one of the characters present, to avoid likely false positives such as in应始终至少存在一个字符，以避免可能的误报，例如

Finish it.

This will of course mean that things like这当然意味着像

sh_it

will not be caught, but as I said, it doesn't matter, it doesn't have to be perfect.不会被抓住，但正如我所说，没关系，它不必是完美的。 All I need is the regex, I can do the splitting of words and inserting the regex myself.我只需要正则表达式，我可以自己拆分单词并插入正则表达式。 I have the RegexOptions.IgnoreCase option set in my C# code, so character case in the actual word is not an issue.我在 C# 代码中设置了 RegexOptions.IgnoreCase 选项，因此实际单词中的字符大小写不是问题。 Also, this regex shouldn't worry about "leetspeek", ie some of the actual letters of the word being replaced by other characters:此外，这个正则表达式不应该担心“leetspeek”，即单词的一些实际字母被其他字符替换：

sh1t

I have a different approach that deals with that.我有一种不同的方法来处理这个问题。 Thank you in advance for your help.预先感谢您的帮助。

Answer 1

让我们看看这个正则表达式是否适合你：

/\w(?:_|\W)+/

Answer 2

\bs[\W_]*h[\W_]*i[\W_]*t[\W_]*(?!\w)

matches characters between letters that aren't word characters or character _ or whitespace characters (also new line breaks)匹配不是单词字符或字符_或空白字符的字母之间的字符（也是换行符）
\b (word boundrary) ensures that Finish it won't match \b （字边界）确保Finish it不会匹配
(?!\w) ensures that sh ituuu wont match, you may want to remove/modify that, as s_hittt will not match as well. (?!\w)确保 sh ituuu 不会匹配，您可能需要删除/修改它，因为s_hittt也不会匹配。 \bs[\W_]*h[\W_]*i[\W_]*t+[\W_]*(?!\w) will match the word with repeated last character \bs[\W_]*h[\W_]*i[\W_]*t+[\W_]*(?!\w)将匹配最后一个字符重复的单词
modification \bs[\W_]*h[\W_]*i[\W_]*t[\W_]*?(?!\w) will make the match of last character class not greedy and in sh it&&& only sh it will match修改\bs[\W_]*h[\W_]*i[\W_]*t[\W_]*?(?!\w)将使最后一个字符类的匹配不贪心，并且在sh it&&&中只sh it将匹配
\bs[\W\d_]*h[\W\d_]*i[\W\d_]*t+[\W\d_]*?(?!\w) will match sh1i444t (digits between characters) \bs[\W\d_]*h[\W\d_]*i[\W\d_]*t+[\W\d_]*?(?!\w)将匹配sh1i444t （字符之间的数字）

EDIT:编辑：

(?!\w) is a negative lookahead. (?!\w) 是一个负前瞻。 It basicly checks if your match is followed by a word character (word characters are [A-z09_]).它基本上检查您的匹配是否后跟一个单词字符（单词字符是 [A-z09_]）。 It has a length of 0, which means it won't be included in the match.它的长度为 0，这意味着它不会包含在匹配中。 If you want to catch words like "s h i*tface" you'll have to remove it.如果你想捕捉像“s h i*tface”这样的词，你必须删除它。 ( http://www.regular-expressions.info/lookaround.html ) （ http://www.regular-expressions.info/lookaround.html ）

A word booundrary [/b] matches a place where word starts or ends, it's length is 0, which means that it matches between characters一个词的边界[/b] 匹配一个词的开始或结束的地方，它的长度是0，这意味着它匹配字符之间

[\W] is a negative character class, I think it's equal to [^a-zA-Z0-9_] or [^\w] [\W] 是一个否定字符类，我认为它等于 [^a-zA-Z0-9_] 或 [^\w]

Answer 3

Alright, HamZa's answer worked.好吧，HamZa 的回答奏效了。 However I ran into a programmatic problem while working on the solution.但是，我在研究解决方案时遇到了程序问题。 When I was replacing just the words, I always knew the length of the word.当我只替换单词时，我总是知道单词的长度。 So I knew exactly how many asterisks to replace it with.所以我确切地知道要用多少个星号来代替它。 If I'm matching shit , I know I need to put 4 asterisks.如果我匹配shit ，我知道我需要加上 4 个星号。 But if I'm matching s[^a-z0-9]+h[^a-z0-9]+[^a-z0-9]+i[^a-z0-9]+t , I might catch s#h#i#t or I may catch s------h------i--------t .但是如果我匹配s[^a-z0-9]+h[^a-z0-9]+[^a-z0-9]+i[^a-z0-9]+t ，我可能会抓住s#h#i#t否则我可能会catch s------h------i--------t 。 In both cases the length of the matched text will differ wildly from that of the pattern.在这两种情况下，匹配文本的长度都会与模式的长度大不相同。 How can I get the actual length of the matched string?如何获得匹配字符串的实际长度？

Answer 4

You want to match words where each letter is separated with the identical non-word char(s).您想匹配每个字母用相同的非单词字符分隔的单词。

You can use您可以使用

\b\p{L}(?=([\W_]+))(?:\1\p{L})+\b

See the regex demo .请参阅正则表达式演示。 (I added (?!\n) to make the regex work for each line as if it were a separate string.) Details : （我添加了(?!\n)以使正则表达式对每一行都起作用，就好像它是一个单独的字符串一样。）详细信息：

\b - word boundary \b - 单词边界
\p{L} - a letter \p{L} - 一个字母
(?=([\W_]+)) - a positive lookahead that matches a location that is immediately followed with any non-word or _ char (captured into Group 1) (?=([\W_]+)) - 一个正向前瞻，它匹配紧随其后的任何非单词或_字符的位置（捕获到第 1 组）
(?:\1\p{L})+ - one or more repetitions of a sequence of the same char captured into Group 1 and a letter (?:\1\p{L})+ - 捕获到第 1 组的相同字符序列的一个或多个重复和一个字母
\b - word boundary. \b - 单词边界。

To check if there is such a pattern in a string, you can use要检查字符串中是否存在这样的模式，您可以使用

var HasSpamWords = Regex.IsMatch(text, @"\b\p{L}(?=([\W_]+))(?:\1\p{L})+\b");

To return all occurrences in a string, you can use要返回字符串中的所有匹配项，您可以使用

var results = Regex.Matches(text, @"\b\p{L}(?=([\W_]+))(?:\1\p{L})+\b")
    .Cast<Match>()
    .Select(x => x.Value)
    .ToList();

See the C# demo .请参阅C# 演示。

Getting the length of each string is easy if you get Match.Length and use .Select(x => x.Length) .如果您获得Match.Length并使用.Select(x => x.Length) ，则获取每个字符串的长度很容易。 If you need to get the length of the string with all special chars removed, simply use .Select(x => x.Value.Count(c => char.IsLetter(c))) (see this C# demo ).如果您需要获取删除所有特殊字符的字符串长度，只需使用.Select(x => x.Value.Count(c => char.IsLetter(c))) （请参阅此 C# 演示）。

用于捕获字母之间具有特殊字符的单词的正则表达式

问题描述

4 个解决方案

解决方案1
2 2013-06-20 09:52:18

解决方案2
2 2013-06-20 10:20:52

解决方案3
2 2013-06-20 11:51:03

解决方案4
0 2022-05-09 08:09:46

用于捕获字母之间具有特殊字符的单词的正则表达式

问题描述

4 个解决方案

解决方案1 2 2013-06-20 09:52:18

解决方案2 2 2013-06-20 10:20:52

解决方案3 2 2013-06-20 11:51:03

解决方案4 0 2022-05-09 08:09:46

解决方案1
2 2013-06-20 09:52:18

解决方案2
2 2013-06-20 10:20:52

解决方案3
2 2013-06-20 11:51:03

解决方案4
0 2022-05-09 08:09:46