I can't find a regex that matches repeated letters. My problem is that I want to use regex to filter out spam-mails, for example, I want to use regex to detect "spam" and "viagra" in these strings : "xxxSpAmyyy", "xxxSPAMyyy", "xxxvI a Gr AA yyy", "xxxV iiA gR a xxx"
Do You have any suggestions how I do that in a good way?
Like searching this?
"v.{0,3}i.{0,3}a.{0,3}g.{0,3}r.{0,3}a"
See Pattern
Code:
This leaves space for 0 to 3 characters between characters. I did not compile the following, but it "should work."
String[] strings = new String[] { ""xxxV iiA gR a xxx"" };
final Pattern spamPattern = makePattern("viagra");
for (String s : strings) {
boolean isSpam = spamPattern.matcher(s).find();
if (isSpam) {
System.out.println("Spam: " + s);
}
}
...
Pattern makePattern(String cusWord) {
cusWord = cusWord.toLowerCase();
StringBuilder sb = new StringBuilder();
sb.append("(?i)"); // Case-insensitive setting.
for (int i = 0; i < cusWord.length(); ) {
int cp = cusWord.codePointAt(i);
i += Character.charCount(cp);
if ('o' == cp) {
sb.append("[o0]");
} else if ('l' == cp) {
sb.append("[l1]");
} else {
sb.appendCodePoint(cp);
}
sb.append(".{0,3}"); // 0 - 3 occurrences of any char.
}
return Pattern.compile(sb.toString());
}
This ignores the case, and it takes them whether they are one next to another, or there are other characters in between them
"(?i).{0,}v.{0,}i.{0,}a.{0,}g.{0,}r.{0,}a.{0,}"
If you know how many characters can be between the letters, you can enter .{0,max_distance}
instead of .{0,}
UPDATE:
It works even for duplicates, as i have tried it:
String str = "xxxV iiA gR a xxx";
if(str.matches("(?i).{0,}v.{0,}i.{0,}a.{0,}g.{0,}r.{0,}a.{0,}")){
System.out.println("Yes");
}
else{
System.out.println("No");
}
This prints Yes
I think, you're on wrong way. Filtering of spam is closely related to machine learning. I'd suggest you to read about Bayesian spam filtering .
If you suppose, that you'll get spam mails with misspelled words (and other kind of garbage) - I'd suggest to use filtering based not on entire words, but on n-grams .
Did you try any regex?
Something like \\w*[sSpPaAmM]+\\w*
should do the trick
You can test your RE on this site : http://www.regexplanet.com/advanced/java/index.html
You could try using positive look-aheads
(?=.*v)(?=.*i)(?=.*a)(?=.*g)(?=.*r)(?=.*a).*
Edit:
(?=.*v.*i.*a.*g.*r.*a.*).*
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.