简体   繁体   中英

c# regex to match specific text

I'm looking to match all text in the format foo:12345 that is not contained within an HTML anchor. For example, I'd like to match lines 1 and 3 from the following:

foo:123456

<a href="http://www.google.com">foo:123456</a>

foo:123456

I've tried these regexes with no success:

Negative lookahead attempt ( incorrectly matches, but doesn't include the last digit )

foo:(\\d+)(?!</a>)

Negative lookahead with non-capturing grouping

(?:foo:(\\d+))(?!</a>)

Negative lookbehind attempt ( wildcards don't seem to be supported )

(?<!<a[^>]>)foo:(\\d+)

If you want to start analysing HTML like this then you probably want to actually parse HTML instead of using regular expressions. The HTML Agility Pack is the usual first port of call. Using Regular Expressions it becomes hard to deal with things like <a></a>foo:123456<a></a> which of course should pull out the middle bit but its extremely hard to write a regex that will do that.

I should add that I am assuming that you do in fact have a block of HTML rather than just individual short strings such as your each line above. Partly I ruled it out becasue matching it if it is the only thing on the line is pretty easy so I figured you'd have got it if you wanted that. :)

Regex is usually not the best tool for the job, but if your case is very specific like in your example you could use:

foo:((?>\d+))(?!</a>)

Your first expression didn't work because \\d+ would backtrack till (?!</a>) matches. This can be fixed by not allowing \\d+ to backtrack, as above with help of an atomic/nonbacktracking group, or you could also make the lookahead fail in case \\d+ backtracks, like:

foo:((?>\d+))(?!</a>|\d)

Altho that is not as efficient.

Note, that lookbehind will not work with differnt string length inside, you may work it out differently

for example

  1. Find and mark all foo-s that are contained in anchor
  2. Find and do your goal with all other
  3. Remove marks

This is prob a long winded way of doing this but you could simply bring back all occurences of foo:some digits then exclude them afterwards..

string pattern = @"foo:\d+ |" +
                 @"foo:\d+[<]";

Then use matchcollection

 MatchCollection m0 = Regex.Matches(file, pattern, RegexOptions.Singleline);

Then loop through each occurrence:

foreach (Match m in m0)
{
                 . . . exclude the matches that contain the "<"
}

I would use linq and treat the html like xml, for example: var query = MyHtml.Descendants().ToArray(); foreach (XElement result in query) {

            if (Regex.IsMatch(result.value, @"foo:123456") && result.Name.ToString() != "a")
            {
               //do something...
            }
        }

perhaps there's a better way, but i don't know it...this seems pretty straight forward to me :P

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM