简体   繁体   English

Java&Regex:匹配不在特定字符前面的子字符串

[英]Java & Regex: Matching a substring that is not preceded by specific characters

This is one of those questions that has been asked and answered hundreds of times over, but I'm having a hard time adapting other solutions to my needs. 这是已被问过并回答了数百次的问题之一,但我很难根据自己的需要调整其他解决方案。

In my Java-application I have a method for censoring bad words in chat messages. 在我的Java应用程序中,我有一种方法可以检查聊天消息中的坏词。 It works for most of my words, but there is one particular (and popular) curse word that I can't seem to get rid of. 它适用于我的大多数单词,但有一个特殊的(和流行的)诅咒词我似乎无法摆脱。 The word is "faen" (which is simply a modern slang for "satan", in the language in question). 这个词是“faen”(在所讨论的语言中,它只是“撒旦”的现代俚语)。

Using the pattern "fa+e+n" for matching multiple A's and E's actually works; 使用模式“fa + e + n”来匹配多个A和E实际上是有效的; however, in this language, the word for "that couch" or "that sofa" is "sofaen". 然而,在这种语言中,“沙发”或“沙发”这个词是“沙发”。 I've tried a lot of different approaches, using variations of [^so] and (?!=so), but so far I haven't been able to find a way to match one and not the other. 我已经尝试了很多不同的方法,使用[^ so]和(?!= so)的变体,但到目前为止,我还没有找到匹配一个而不是另一个的方法。

The real goal here, is to be able to match the bad words, regardless of the number of vowels, and regardless of any non-letters in between the components of the word. 这里的真正目标是能够匹配坏词,无论元音的数量如何,并且无论词的组成部分之间是否有任何非字母。

Here's a few examples of what I'm trying to do: 以下是我正在尝试做的几个例子:

"String containing faen"                        Should match
"String containing sofaen"                      Should not match
"Non-letter-censored string with f-a@a-e.n"     Should match
"Non-letter-censored string with sof-a@a-e.n"   Should not match

Any tips to set me off in the right direction on this? 有什么建议可以让我朝着正确的方向前进吗?

You want something like \\bf[^\\s]+a[^\\s]+e[^\\s]+n[^\\s]\\b . 你想要\\bf[^\\s]+a[^\\s]+e[^\\s]+n[^\\s]\\b Note that this is the regular expression; 请注意,这是正则表达式; if you want the Java then you need to use \\\\b[^\\\\s]+f[^\\\\s]+a[^\\\\s]+e[^\\\\s]+n[^\\\\s]\\b . 如果你想要Java那么你需要使用\\\\b[^\\\\s]+f[^\\\\s]+a[^\\\\s]+e[^\\\\s]+n[^\\\\s]\\b

Note also that this isn't perfect, but does handle the situations that you have suggested. 另请注意,这并不完美,但可以处理您建议的情况。

It's a terrible idea to begin with. 一开始这是一个糟糕的主意。 You think, your users would write something like "f-aeen" to avoid your filter but would not come up with "ffaen" or "-faen" or whatever variation that you did not prepare for? 你认为,你的用户会写一些像“f-aeen”这样的东西来避免你的过滤器,但不会想出“ffaen”或“-faen”或者你没有准备的任何变化? This is a race you cannot win and the real loser is usability. 这是一场你无法获胜的比赛,真正的输家就是可用性。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM