简体   繁体   English

匹配所有字符的正则表达式返回的匹配太少

[英]Regular expression matching all characters returns too few matches

I'm trying to parse html page and I use the following regular expression: 我正在尝试解析html页面,并使用以下正则表达式:

var regex = new Regex(@"<tag1 id=.id1.>.*<tag2>", RegexOptions.Singleline);

"tag1 id =.id.1" occurs in document only once. “ tag1 id = .id.1”在文档中仅出现一次。 "tag2" occurs nearly 50 times after the occurance of "tag 1". “ tag2”出现在“ tag 1”出现后近50次。 But when I try to match page code with my regular expression, it returns only 1 match. 但是,当我尝试用正则表达式匹配页面代码时,它仅返回1个匹配项。 Moreover, when I change RegexOptions to "None" or "Multiline" no matches are returned. 此外,当我将RegexOptions更改为“ None”或“ Multiline”时,不返回任何匹配项。 I'm very confused about this and would appreciate any help. 我对此很困惑,希望能有所帮助。

Parsing Html with RegEx is a very bad idea and its unreliable because there still exists a lot of "broken html" in the world. 使用RegEx解析HTML是一个非常糟糕的主意,并且它不可靠,因为世界上仍然存在许多“损坏的html”。 To parse HTML, I would suggest using the HTML Agility Pack . 要解析HTML,我建议使用HTML Agility Pack It is an excellent library for parsing HTML and I never had an issue with any HTML I've fed into it. 这是一个用于解析HTML的优秀库,我输入的任何HTML都没有问题。

Leaving aside the obvious exhortations about not using regex to parse HTML, I can explain to you why you're seeing what you're seeing. 除了不使用正则表达式解析HTML的明显建议外,我可以向您解释为什么您看到自己看到的内容。

If tag1 occurs in your text only once, then the regex can only match it once, so there can never be more than one match. 如果tag1在您的文本中仅出现一次,则正则表达式只能匹配一次,因此永远不会有多个匹配项。 Regular expression matches "consume" the text they have matched, so the next match attempt starts at the end of the last successful match. 正则表达式匹配“消耗”他们匹配的文本,因此下一次匹配尝试从上一次成功匹配的末尾开始。

This leads to the next problem: .* is greedy, so it matches (with RegexOptions.Singleline ) until the end of the string and then backtracks until the last <tag2> it finds in order to allow a successful match. 这导致下一个问题: .*是贪婪的,因此它与RegexOptions.Singleline匹配(直到RegexOptions.Singleline ),直到字符串的结尾,然后回溯直到找到的最后一个<tag2> ,以允许成功匹配。 Which is another reason why you only get one match. 这是您只获得一场比赛的另一个原因。

As for your second question: Why do the matches go away if you don't use RegexOptions.Singleline ? 至于您的第二个问题:如果不使用RegexOptions.Singleline为什么比赛会消失? Simple: Without that option, the dot . 简单:如果没有该选项,则点号. cannot match newlines, and there appears to be at least one newline between tag1 and the first tag2 . 不能匹配换行符,并且在tag1和第一个tag2之间似乎至少有一个换行符。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM