[英]C# Regex Problem
I want to extract all table rows from an HTML page. 我想从HTML页面提取所有表行。 But using the pattern @"<tr>([\\w\\W]*)</tr>"
is not working. 但是,使用模式@"<tr>([\\w\\W]*)</tr>"
无效。 It's giving one result which is first occurence of <tr>
to last occurrence of </tr>
. 它给出一个结果,该结果是<tr>
的第一次出现到</tr>
最后一次出现。 But I want every occurrence of <tr>...</tr>
value. 但是我希望每次出现<tr>...</tr>
值。 Can anyone please tell me how I can do this? 谁能告诉我我该怎么做?
[\\w\\W]*
matches greedily so it will match from the first <tr>
to the last </tr>
. [\\w\\W]*
贪婪地匹配,因此它将从第一个<tr>
到最后一个</tr>
匹配。
A regex approach won't work well because HTML is not a regular language. 正则表达式方法不能很好地工作,因为HTML不是一种常规语言。 If you really wanted to try to use a lazy modifier such as "<tr>(.*?)</tr>"
with the RegexOptions.Singleline
flag, however this isn't guaranteed to work in all cases. 如果您确实想尝试使用带有RegexOptions.Singleline
标志的"<tr>(.*?)</tr>"
之类的RegexOptions.Singleline
,但是不能保证在所有情况下都可以使用。
For parsing HTML you need an HTML parser. 为了解析HTML,您需要一个HTML解析器。 Try HTML Agility Pack . 尝试HTML Agility Pack 。
I do agree with Mark: you should to use HTML Agility Pack library. 我确实同意Mark的观点:您应该使用HTML Agility Pack库。
About your regex, you should to go with something like: 关于您的正则表达式,您应该使用类似以下的内容:
@"<tr>([\s\S]*?)</tr>"
That's a non greedy pattern, and you should to get one match for every TR. 那是一种非贪婪的模式,您应该为每个TR获得一个匹配。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.