简体   繁体   中英

Java regular expression for repeated letters

I can't find a regex that matches repeated letters. My problem is that I want to use regex to filter out spam-mails, for example, I want to use regex to detect "spam" and "viagra" in these strings : "xxxSpAmyyy", "xxxSPAMyyy", "xxxvI a Gr AA yyy", "xxxV iiA gR a xxx"

Do You have any suggestions how I do that in a good way?

Like searching this?

"v.{0,3}i.{0,3}a.{0,3}g.{0,3}r.{0,3}a"

See Pattern


Code:

This leaves space for 0 to 3 characters between characters. I did not compile the following, but it "should work."

String[] strings = new String[] { ""xxxV iiA gR a xxx"" };
final Pattern spamPattern = makePattern("viagra");
for (String s : strings) {
    boolean isSpam = spamPattern.matcher(s).find();
    if (isSpam) {
        System.out.println("Spam: " + s);
    }
}
...
Pattern makePattern(String cusWord) {
    cusWord = cusWord.toLowerCase();
    StringBuilder sb = new StringBuilder();
    sb.append("(?i)"); // Case-insensitive setting.
    for (int i = 0; i < cusWord.length(); ) {
        int cp = cusWord.codePointAt(i);
        i += Character.charCount(cp);
        if ('o' == cp) {
            sb.append("[o0]");
        } else if ('l' == cp) {
            sb.append("[l1]");
        } else {
            sb.appendCodePoint(cp);
        }
        sb.append(".{0,3}"); // 0 - 3 occurrences of any char.
    }
    return Pattern.compile(sb.toString());
}

This ignores the case, and it takes them whether they are one next to another, or there are other characters in between them

"(?i).{0,}v.{0,}i.{0,}a.{0,}g.{0,}r.{0,}a.{0,}"

If you know how many characters can be between the letters, you can enter .{0,max_distance} instead of .{0,}

UPDATE:

It works even for duplicates, as i have tried it:

    String str = "xxxV iiA gR a xxx";

    if(str.matches("(?i).{0,}v.{0,}i.{0,}a.{0,}g.{0,}r.{0,}a.{0,}")){
        System.out.println("Yes");
    }
    else{
        System.out.println("No");
    }

This prints Yes

I think, you're on wrong way. Filtering of spam is closely related to machine learning. I'd suggest you to read about Bayesian spam filtering .

If you suppose, that you'll get spam mails with misspelled words (and other kind of garbage) - I'd suggest to use filtering based not on entire words, but on n-grams .

Did you try any regex?

Something like \\w*[sSpPaAmM]+\\w* should do the trick

You can test your RE on this site : http://www.regexplanet.com/advanced/java/index.html

You could try using positive look-aheads

(?=.*v)(?=.*i)(?=.*a)(?=.*g)(?=.*r)(?=.*a).*

Edit:

(?=.*v.*i.*a.*g.*r.*a.*).*

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM