C＃正则表达式问题

Question

I want to extract all table rows from an HTML page. 我想从HTML页面提取所有表行。 But using the pattern @"<tr>([\\w\\W]*)</tr>" is not working. 但是，使用模式@"<tr>([\\w\\W]*)</tr>"无效。 It's giving one result which is first occurence of <tr> to last occurrence of </tr> . 它给出一个结果，该结果是<tr>的第一次出现到</tr>最后一次出现。 But I want every occurrence of <tr>...</tr> value. 但是我希望每次出现<tr>...</tr>值。 Can anyone please tell me how I can do this? 谁能告诉我我该怎么做？

Answer 1

[\\w\\W]* matches greedily so it will match from the first <tr> to the last </tr> . [\\w\\W]* 贪婪地匹配，因此它将从第一个<tr>到最后一个</tr>匹配。

A regex approach won't work well because HTML is not a regular language. 正则表达式方法不能很好地工作，因为HTML不是一种常规语言。 If you really wanted to try to use a lazy modifier such as "<tr>(.*?)</tr>" with the RegexOptions.Singleline flag, however this isn't guaranteed to work in all cases. 如果您确实想尝试使用带有RegexOptions.Singleline标志的"<tr>(.*?)</tr>"之类的RegexOptions.Singleline ，但是不能保证在所有情况下都可以使用。

For parsing HTML you need an HTML parser. 为了解析HTML，您需要一个HTML解析器。 Try HTML Agility Pack . 尝试HTML Agility Pack 。

Answer 2

I do agree with Mark: you should to use HTML Agility Pack library. 我确实同意Mark的观点：您应该使用HTML Agility Pack库。

About your regex, you should to go with something like: 关于您的正则表达式，您应该使用类似以下的内容：

@"<tr>([\s\S]*?)</tr>"

That's a non greedy pattern, and you should to get one match for every TR. 那是一种非贪婪的模式，您应该为每个TR获得一个匹配。

C＃正则表达式问题

问题描述

2 个解决方案

解决方案1
5 2011-02-04 22:55:52

解决方案2
2 已采纳 2011-02-04 23:00:10

C＃正则表达式问题

问题描述

2 个解决方案

解决方案1 5 2011-02-04 22:55:52

解决方案2 2 已采纳 2011-02-04 23:00:10

解决方案1
5 2011-02-04 22:55:52

解决方案2
2 已采纳 2011-02-04 23:00:10