简体   繁体   中英

Trying to extract information from an email with .NET regex

I am trying to extract some information in the "disclaimer" area of stock promotion "tout" email (junk mail to most).

Typically a tout will have a disclaimer to the effect of:

Company XYZ has been compensated fifty thousand dollars for a two week promotion of stock ABC.

I have a regex that works for cases like this (may not be the most efficient as it stands) and it seems to work for most cases. However, when the disclaimer uses a web address to refer to the promoting company (ie, www.companyxyz.com instead of Company XYZ) my regex grabs the ".com" and the rest of the phrase I am trying to capture -- but not the "www.companyxyz" part.

Here is my regex method:

    public string ExtractCompensationLine(string message)
    {
        string compensationLine = string.Empty;
        string messageLine = Regex.Replace(message, "[\n\r\t]", " ");
        string leftPrefix = @"\.((\w|\s|\d|\,)+";
        string rightPrefix = @"(\w|\s|\d|\,)+\.)";

        string[] phrases = 
        {
            @"has been compensated",
            @"we were also paid",
            @"has been previously compensated",
            @"currently being compensated",
            @"the company has compensated",
            @"has agreed to be compensated",
            @"have been compensated up to",
            @"dollars from a third party",
            @"the company will compensate us"
        };

        foreach (string phrase in phrases)
        {
            string pattern = leftPrefix + phrase + rightPrefix;
            Regex compensationRegex = new Regex(pattern, RegexOptions.IgnoreCase);
            Match match = compensationRegex.Match(messageLine);

            if (match.Success)
            {
                compensationLine += match.Groups[1].Value;
            }
        }

        return compensationLine;
    }

So, the regex captures the whole phrase from the first word of the sentence (by finding the previous period, up until the last period of the sentence. But these web addresses don't play nice with my regex.

If I understand your problem correctly, given a sentence that contains one of the given phrases, you want to capture from the beginning of that sentence to its end, or end of line. Your challenge is to find the end of the sentence that precedes the one you want to match. So you need to match on ". " (period followed by whitespace.) Then the rest.

I don't understand why you have "(\\w|\\s|\\d|\\,)" instead of just "." It will not give the result I describe above but I'll leave that as is, and just focus on the problem you described.

So try this:

leftPrefix = @"(\.*\s+)*?((\w|\d|\,)+";

(.*\\s+)* : match any characters followed by a period followed by whitespace.

Since I use parens to group this new subexpression you will have a new capture group which means that you need to use the Captures collection of the Match object, not the Value.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM