简体   繁体   English

C#正则表达式问题

[英]C# Regex Problem

I want to extract all table rows from an HTML page. 我想从HTML页面提取所有表行。 But using the pattern @"<tr>([\\w\\W]*)</tr>" is not working. 但是,使用模式@"<tr>([\\w\\W]*)</tr>"无效。 It's giving one result which is first occurence of <tr> to last occurrence of </tr> . 它给出一个结果,该结果是<tr>的第一次出现到</tr>最后一次出现。 But I want every occurrence of <tr>...</tr> value. 但是我希望每次出现<tr>...</tr>值。 Can anyone please tell me how I can do this? 谁能告诉我我该怎么做?

[\\w\\W]* matches greedily so it will match from the first <tr> to the last </tr> . [\\w\\W]* 贪婪地匹配,因此它将从第一个<tr>到最后一个</tr>匹配。

A regex approach won't work well because HTML is not a regular language. 正则表达式方法不能很好地工作,因为HTML不是一种常规语言。 If you really wanted to try to use a lazy modifier such as "<tr>(.*?)</tr>" with the RegexOptions.Singleline flag, however this isn't guaranteed to work in all cases. 如果您确实想尝试使用带有RegexOptions.Singleline标志的"<tr>(.*?)</tr>"之类的RegexOptions.Singleline ,但是不能保证在所有情况下都可以使用。

For parsing HTML you need an HTML parser. 为了解析HTML,您需要一个HTML解析器。 Try HTML Agility Pack . 尝试HTML Agility Pack

I do agree with Mark: you should to use HTML Agility Pack library. 我确实同意Mark的观点:您应该使用HTML Agility Pack库。

About your regex, you should to go with something like: 关于您的正则表达式,您应该使用类似以下的内容:

@"<tr>([\s\S]*?)</tr>"

That's a non greedy pattern, and you should to get one match for every TR. 那是一种非贪婪的模式,您应该为每个TR获得一个匹配。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM