简体   繁体   English

如何从C#中的文本中提取连续的电子邮件地址

[英]How to extract consecutive email addresses from text in C#

I have the following three examples of strings: 我有以下三个字符串示例:

string1 = "abcd@efg.com this is just some text. these are just some numbers 123456 xyz@xyz.com asdasd asdad" string1 =“abcd@efg.com这只是一些文字。这些只是一些数字123456 xyz@xyz.com asdasd asdad”

string2 = "abcd@efg.com mnop@qrs.com This is just some text. these are just some numbers 123456 xyz@xyz.com asdasd asd" string2 =“abcd@efg.com mnop@qrs.com这只是一些文字。这些只是一些数字123456 xyz@xyz.com asdasd asd”

string3 = "abcd@efg.com mnop@qrs.com uvw@xyz.com This is just some text. these are just some numbers 123456 xyz@xyz.com asdad" string3 =“abcd@efg.com mnop@qrs.com uvw@xyz.com这只是一些文字。这些只是一些数字123456 xyz@xyz.com asdad”

Final output should be a List consisting of all the emails that appear consecutively at the beginning of the string. 最终输出应该是一个列表,其中包含在字符串开头连续出现的所有电子邮件。

Output for string1 - one email address string1的输出 - 一个电子邮件地址

Output for string3 - three email addresses string3的输出 - 三个电子邮件地址

Address "xyz@xyz.com" should be ignored as it appears between some other text. 地址“xyz@xyz.com”应该被忽略,因为它出现在其他一些文本之间。 Is there any solution for this? 这有什么解决方案吗? The existing method returns all the addresses. 现有方法返回所有地址。

    private List<string> ExtractEmails(string strStringGoesHere)
    {
        List<string> lstExtractedEmails = new List<string>();
        Regex reg = new Regex(@"[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,6}", RegexOptions.IgnoreCase);
        Match match;
        for (match = reg.Match(strStringGoesHere); match.Success; match = match.NextMatch())
        {
            if (!(lstExtractedEmails.Contains(match.Value)))
            {
                lstExtractedEmails.Add(match.Value);
            }
        }
        return lstExtractedEmails;
    }

You may use \\G anchor that only matches at the start of the string and then at the end of each successful match: 您可以使用\\G锚点仅在字符串的开头匹配,然后在每次成功匹配结束时匹配:

@"(?i)\G\s*([A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,6})"

See this demo 这个演示

Details 细节

  • (?i) - inline case insensitive flag (?i) - 内联不区分大小写的标志
  • \\G - anchor that only matches at the start of the string and at the end of each successful match \\G - 仅在字符串开头和每次成功匹配结束时匹配的锚点
  • \\s* - 0+ whitespaces \\s* - 0+空格
  • ([A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[AZ]{2,6}) - Group 1 matching an email like substring (there are other patterns that you may use here , but generally, it is something like \\S+@\\S+\\.\\S+ ). ([A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[AZ]{2,6}) - 组1匹配像substring这样的电子邮件(还有其他模式,你可以在这里使用 ,但一般来说,它类似于\\S+@\\S+\\.\\S+ )。

C# demo : C#demo

var strs = new List<string> {"abcd@efg.com this is just some text. these are just some numbers 123456 xyz@xyz.com asdasd asdad",
    "abcd@efg.com mnop@qrs.com This is just some text. these are just some numbers 123456 xyz@xyz.com asdasd asd",
    "abcd@efg.com mnop@qrs.com uvw@xyz.com This is just some text. these are just some numbers 123456 xyz@xyz.com asdad" };
foreach (var s in strs) 
{
    var results = Regex.Matches(s, @"(?i)\G\s*([A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,6})")
        .Cast<Match>()
        .Select(x => x.Groups[1].Value);
    Console.WriteLine(string.Join(", ", results));
}

Results: 结果:

abcd@efg.com
abcd@efg.com, mnop@qrs.com
abcd@efg.com, mnop@qrs.com, uvw@xyz.com

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM