简体   繁体   English

使用 C# 中的正则表达式解析 email header

[英]Parse email header with Regex in C#

I've got a webhook posting to a form on my web application and I need to parse out the email header addresses.我在我的 web 应用程序的表单中发布了一个 webhook,我需要解析出 email header 地址。

Here is the source text:以下是原文:

Thread-Topic: test subject
Thread-Index: AcwE4mK6Jj19Hgi0SV6yYKvj2/HJbw==
From: "Lastname, Firstname" <firstname_lastname@domain.com>
To: <testto@domain.com>, testto1@domain.com, testto2@domain.com
Cc: <testcc@domain.com>, test3@domain.com
X-OriginalArrivalTime: 27 Apr 2011 13:52:46.0235 (UTC) FILETIME=[635226B0:01CC04E2]

I'm looking to pull out the following:我正在寻找以下内容:

<testto@domain.com>, testto1@domain.com, testto2@domain.com

I'm been struggling with Regex all day without any luck.我整天都在用正则表达式苦苦挣扎,没有任何运气。

Contrary to some of the posts here I have to agree with mmutz, you cannot parse emails with a regex... see this article:与这里的一些帖子相反,我必须同意 mmutz,你不能用正则表达式解析电子邮件......请参阅这篇文章:

http://tools.ietf.org/html/rfc2822#section-3.4.1 http://tools.ietf.org/html/rfc2822#section-3.4.1

3.4.1. 3.4.1。 Addr-spec specification地址规范规范

An addr-spec is a specific Internet identifier that contains a locally interpreted string followed by the at-sign character ("@", ASCII value 64) followed by an Internet domain. addr-spec 是一个特定的 Internet 标识符,它包含一个本地解释的字符串,后跟 at-sign 字符(“@”,ASCII 值 64),后跟一个 Internet 域。

The idea of "locally interpreted" means that only the receiving server is expected to be able to parse it. “本地解释”的想法意味着只有接收服务器才能解析它。

If I were going to try and solve this I would find the "To" line contents, break it apart and attempt to parse each segment with System.Net.Mail.MailAddress.如果我要尝试解决这个问题,我会找到“To”行内容,将其拆分并尝试使用 System.Net.Mail.MailAddress 解析每个段。

    static void Main()
    {
        string input = @"Thread-Topic: test subject
Thread-Index: AcwE4mK6Jj19Hgi0SV6yYKvj2/HJbw==
From: ""Lastname, Firstname"" <firstname_lastname@domain.com>
To: <testto@domain.com>, ""Yes, this is valid""@[emails are hard to parse!], testto1@domain.com, testto2@domain.com
Cc: <testcc@domain.com>, test3@domain.com
X-OriginalArrivalTime: 27 Apr 2011 13:52:46.0235 (UTC) FILETIME=[635226B0:01CC04E2]";

        Regex toline = new Regex(@"(?im-:^To\s*:\s*(?<to>.*)$)");
        string to = toline.Match(input).Groups["to"].Value;

        int from = 0;
        int pos = 0;
        int found;
        string test;

        while(from < to.Length)
        {
            found = (found = to.IndexOf(',', from)) > 0 ? found : to.Length;
            from = found + 1;
            test = to.Substring(pos, found - pos);

            try
            {
                System.Net.Mail.MailAddress addy = new System.Net.Mail.MailAddress(test.Trim());
                Console.WriteLine(addy.Address);
                pos = found + 1;
            }
            catch (FormatException)
            {
            }
        }
    }

Output from the above program: Output 来自上述程序:

testto@domain.com
"Yes, this is valid"@[emails are hard to parse!]
testto1@domain.com
testto2@domain.com

The RFC 2822-compliant email regex is:符合 RFC 2822 的 email 正则表达式是:

(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])

Just run it over your text and you'll get the email addresses.只需在您的文本上运行它,您将获得 email 地址。

Of course, there's always the option of not using regex where regex isn't the best option.当然,在 regex 不是最佳选择的情况下,总是可以选择不使用 regex。 But up to you!但取决于你!

You cannot use regular expressions to parse RFC2822 mails, because their grammar contains a recursive production (off the top of my head, it was for comments (a (nested) comment) ) which makes the grammar non-regular.您不能使用正则表达式来解析 RFC2822 邮件,因为它们的语法包含递归产生式(在我的脑海中,它是用于注释(a (nested) comment) ),这使得语法不规则。 Regular expressions (as the name suggests) can only parse regular grammars.正则表达式(顾名思义)只能解析正则文法。

See also RegEx match open tags except XHTML self-contained tags for more information.有关详细信息,另请参阅RegEx 匹配开放标记(XHTML 自包含标记除外)

As Blindy suggests, sometimes you can just parse it out the old-fashioned way.正如 Blindy 建议的那样,有时你可以用老式的方式解析它。

If you prefer to do that, here is a quick approach assuming the email header text is called 'header':如果您更喜欢这样做,这里有一个快速方法,假设 email header 文本称为“标题”:

int start = header.IndexOf("To: ");
int end = header.IndexOf("Cc: ");
string x = header.Substring(start, end-start);

I may be off by a byte on the subtraction but you can very easily test and modify this.我可能会在减法上偏离一个字节,但你可以很容易地测试和修改它。 Of course you will also have to be certain you always will have a Cc: row in your header or this won't work.当然,您还必须确定您的 header 中始终会有一个 Cc: 行,否则这将不起作用。

There's a breakdown of validating emails with regex here , which references a more practical implementation of RFC 2822 with:这里有一个使用正则表达式验证电子邮件的细分,它引用了一个更实际的 RFC 2822 实现:

[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?

It also looks like you only want the email addresses out of the "To" field, and you've got the <> to worry about as well, so something like the following would likely work:看起来您只希望 email 地址超出“收件人”字段,并且您还需要担心 <>,因此类似以下内容可能会起作用:

^To: ((?:\<?[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\>?,?(?:\s*))*)

Again, as others having mentioned, you might not want to do this.同样,正如其他人所提到的,您可能不想这样做。 But if you want regex that will turn that input into <testto@domain.com>, testto1@domain.com, testto2@domain.com , that'll do it.但是,如果您想要将输入转换为<testto@domain.com>, testto1@domain.com, testto2@domain.com的正则表达式,那就可以了。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM