简体   繁体   English

C#正则表达式以匹配特定文本

[英]c# regex to match specific text

I'm looking to match all text in the format foo:12345 that is not contained within an HTML anchor. 我想匹配HTML锚中未包含的所有格式为foo:12345的文本。 For example, I'd like to match lines 1 and 3 from the following: 例如,我想匹配以下第1行和第3行:

foo:123456

<a href="http://www.google.com">foo:123456</a>

foo:123456

I've tried these regexes with no success: 我试过这些正则表达式没有成功:

Negative lookahead attempt ( incorrectly matches, but doesn't include the last digit ) 否定前瞻尝试(错误匹配,但不包括最后一位数)

foo:(\\d+)(?!</a>)

Negative lookahead with non-capturing grouping 具有非捕获分组的负前瞻

(?:foo:(\\d+))(?!</a>)

Negative lookbehind attempt ( wildcards don't seem to be supported ) 负面的后视尝试(似乎不支持通配符)

(?<!<a[^>]>)foo:(\\d+)

If you want to start analysing HTML like this then you probably want to actually parse HTML instead of using regular expressions. 如果你想开始像这样分析HTML,那么你可能想要实际解析HTML而不是使用正则表达式。 The HTML Agility Pack is the usual first port of call. HTML Agility Pack是通常的第一个停靠点。 Using Regular Expressions it becomes hard to deal with things like <a></a>foo:123456<a></a> which of course should pull out the middle bit but its extremely hard to write a regex that will do that. 使用正则表达式很难处理<a></a>foo:123456<a></a>类的东西,该东西当然应该拉出中间位,但是编写正则表达式将非常困难。

I should add that I am assuming that you do in fact have a block of HTML rather than just individual short strings such as your each line above. 我应该补充一点,我假设你实际上有一块HTML而不是单独的短字符串,例如你上面的每一行。 Partly I ruled it out becasue matching it if it is the only thing on the line is pretty easy so I figured you'd have got it if you wanted that. 在某种程度上,我排除了它是因为匹配它是否是唯一容易的事情,所以我认为如果您愿意的话就可以得到它。 :) :)

Regex is usually not the best tool for the job, but if your case is very specific like in your example you could use: 正则表达式通常不是工作的最佳工具,但是如果您的案例非常具体,例如您的示例,则可以使用:

foo:((?>\d+))(?!</a>)

Your first expression didn't work because \\d+ would backtrack till (?!</a>) matches. 你的第一个表达不起作用,因为\\d+会回溯直到(?!</a>)匹配。 This can be fixed by not allowing \\d+ to backtrack, as above with help of an atomic/nonbacktracking group, or you could also make the lookahead fail in case \\d+ backtracks, like: 可以通过不允许\\d+回溯来解决此问题,如上所述,在atomic / nonbacktracking组的帮助下,也可以在\\d+回溯的情况下使超前查找失败,例如:

foo:((?>\d+))(?!</a>|\d)

Altho that is not as efficient. 虽然效率不高。

Note, that lookbehind will not work with differnt string length inside, you may work it out differently 请注意,lookbehind不适用于内部不同的字符串长度,您可以采用其他方式进行计算

for example 例如

  1. Find and mark all foo-s that are contained in anchor 查找并标记锚点中包含的所有foo-s
  2. Find and do your goal with all other 与其他人一起寻找并达成目标
  3. Remove marks 去除痕迹

This is prob a long winded way of doing this but you could simply bring back all occurences of foo:some digits then exclude them afterwards.. 这可能是一个漫长的尝试,但是您可以简单地带回所有foo的出现:一些数字,然后再排除它们。

string pattern = @"foo:\d+ |" +
                 @"foo:\d+[<]";

Then use matchcollection 然后使用matchcollection

 MatchCollection m0 = Regex.Matches(file, pattern, RegexOptions.Singleline);

Then loop through each occurrence: 然后循环每次出现:

foreach (Match m in m0)
{
                 . . . exclude the matches that contain the "<"
}

I would use linq and treat the html like xml, for example: var query = MyHtml.Descendants().ToArray(); 我会使用linq并将html视为xml,例如:var query = MyHtml.Descendants()。ToArray(); foreach (XElement result in query) { foreach(查询中XElement结果){

            if (Regex.IsMatch(result.value, @"foo:123456") && result.Name.ToString() != "a")
            {
               //do something...
            }
        }

perhaps there's a better way, but i don't know it...this seems pretty straight forward to me :P 也许有更好的方法,但是我不知道。。。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM