简体   繁体   中英

Regex for detecting emails in text

I have a Regex in C# to detect emails in text and then I put an anchor tag with mailto parameter in it to make it clickable. But if the email is already in an anchor tag, the regex detects the email in the anchor tag and then then next code puts another anchor tag over it. Is there any way in Regex to avoid the emails which are already in the anchor tag?

The regex code in C# is:

string sRegex = @"([\w-]+(\.[\w-]+)*@([a-z0-9-]+(\.[a-z0-9-]+)*?\.[a-z]{2,6}|(\d{1,3}\.){3}\d{1,3})(:\d{4})?)";

Regex Regx = new Regex(sRegex, RegexOptions.IgnoreCase);

and the sample text is:

string sContent = "ttt <a href='mailto:someone@example.com'>someemail@mail.com</a> abc email@email.com";

and the desired output is:

"ttt <a href='mailto:someone@example.com'>someemail@mail.com</a> abc <a href='mailto:email@email.com'>email@email.com</a>";

So, the whole point here is that Regex should only detect those valid emails which are not inside an anchor tag or already clickable and neither should be the anchor tag's href value inside the anchor tag.

The above given Regex is detecting every possible email in the text which is not desired.

Could you use a negative look behind to test for mailto:

(?<!mailto\\:)([\\w-]+(.[\\w-]+)@([a-z0-9-]+(.[a-z0-9-]+)?.[az]{2,6}|(\\d{1,3}.){3}\\d{1,3})(:\\d{4})?)

Should match anything that is not preceded by mailto:

I think what is happening is the . in ([\\w\\-]+(.[\\w-])+) is matching too much. Did you mean to use . rather than \\. ?

By escaping the . the following code produces

someemail@mail.com
email@email.com


public void Test()
{

    Regex pattern = new Regex(@"\b(?<!mailto:)([\w\-]+(\.[\w\-])*@([a-z0-9-]+(.[a-z0-9-]+)?.[a-z]{2,6}|(\d{1,3}.){3}\d{1,3})(:\d{4})?)");
    MatchCollection matchCollection = pattern.Matches("ttt <a href='mailto:someone@example.com'>someemail@mail.com</a> abc email@email.com");
    foreach (Match match in matchCollection)
    {
        Debug.WriteLine(match);
    }
}

A real world implementation of what it seems like you're trying to do might look more like this

Regex pattern = new Regex(@"(?<!mailto\:)\b[\w\-]+@[a-z0-9-]+(\.[a-z0-9\-])*\.[a-z]{2,8}\b(?!\<\/a)");
MatchCollection matchCollection = pattern.Matches("ttt <a href='mailto:so1meone@example.com'>someemail@mail.com</a> abc email@email.com");
foreach (Match match in matchCollection)
{
    Debug.WriteLine(match);
}

Sorry, you are correct, I hadn't considered that the negative assertion wouldn't be greedy enough.

\\b(?!mailto\\:)([\\w-]+(.[\\w-]+)@([a-z0-9-]+(.[a-z0-9-]+)?.[az]{2,6}|(\\d{1,3}.){3}\\d{1,3})(:\\d{4})?)

should work

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM