简体   繁体   中英

Why isnt this regexp backtrack working

I have tried to use the following kind of regex

([_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,4}))|(FakeEmail:)|(Email:)|(\1\2)|(\1\3)

(pretend the \\1 is the email regex group, and \\2 is FakeEmail: and \\3 is Email: because I didnt count the parens to figure out the real grouping)

What I am trying to do is say "Find the word email: and if you find it, pick up any email address following the word."

That email regex I got off some other question on stack overflow.

my test string could be something like

    "This guy is spamming me from
FakeEmail: fakeemailAdress@someplace.com
 but here is is real info:
Email: testemail@someplace.com"

Any tips? Thanks

I'm either quite confused as to what you're trying to do, or your Regex is just very wrong. In particular:

Why do you have Email: at the end, instead of the beginning - to match your example?

Why do you have both your Email: and your \\1\\2 separated by pipe characters, almost as if they're in fields? This is compiling the pattern as ORs. (Find the email pattern, OR the word "Email:", OR whatever \\1\\2 will end up meaning as it is out of context here.)

If all you're trying to do is match something like Email: testemail@someplace.com , you don't need any backtracking.

Something like this is probably all you need:

Email:\s+([_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,4}))

Also, I'd strongly advise against trying to validate an email address so strictly. You may want to read http://haacked.com/archive/2007/08/21/i-knew-how-to-validate-an-email-address-until-i.aspx . I'd simplify the pattern to something more along the lines of:

Email:\s+(\S+)*@(\S+\.\S+)

Try:

(Fake)?Email: *([_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,4}))

And captured group \\1 will be empty if it's a real email and contain "Fake" if it's a fake email, while \\2 will be the email itself.

Do you actually want to capture it if it's FakeEmail though? If you want to capture all Email but ignore all FakeEmail then do:

\bEmail: *([_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,4}))

The word boundary prevents the Email bit from matching "FakeEmail".

UPDATE : note your regex only matches lowercase since it's got az in the [] everywhere but not [AZ] . Make sure you feed your regex into the java match function with the ignore case switch. ie:

Pattern.compile("(Fake)?Email: .....", Pattern.CASE_INSENSITIVE)

You can use following code to match all type of email address:

String text = "This guy is spamming me from\n" +
    "FakeEmail: fakeemail+Adress@someplace.com\n" +
    "fakeEmail: \n" +
    "fakeemail@someplace.com" +
    "but here is is real info:\n" +
    "Email: test.email+info@someplace.com\n";

Matcher m = Pattern.compile("(?i)(?s)Email:\\s*([_a-z\\d\\+-]+(\\.[_a-z\\d\\+-]+)*@[a-z\\d-]+(\\.[a-z\\d-]+)*(\\.[a-z]{2,4}))").matcher(text);
while(m.find())
    System.out.printf("Email is [%s]%n", m.group(1));

This will match email text:

  • appearing on different lines by using (?s)
  • ignoring case comparison by using (?i)
  • Email address with a period . in it
  • Email address with a plus sign + in it

OUTPUT: From above code is

Email is [fakeemail+Adress@someplace.com]
Email is [fakeemail@someplace.comb]
Email is [test.email+info@someplace.com]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM