C＃正则表达式提取div的内容

Question

I've seen some related questions of mine, and I tried them but they don't work. 我已经看到了我的一些相关问题，我尝试了它们但是它们不起作用。 I want to match the content from a div with the id "thumbs". 我想匹配div中的内容和id“thumbs”。 But the regex.Success returns false :( 但是regex.Success返回false :(

Match regex = Regex.Match(html, @"<div[^>]*id=""thumbs"">(.+?)</div>");

Answer 1

Regex is not a good choice for parsing HTML files.. 正则表达式不是解析HTML文件的好选择。

HTML is not strict nor is it regular with its format.. HTML格式不严格，格式也不规则。

Use htmlagilitypack 使用htmlagilitypack

Why use parser? 为什么要使用解析器？

Consider your regex..There are infinite number of cases where you could break your code 考虑你的正则表达式。有无数种情况你可以破坏你的代码

Your regex won't work if there are nested divs 如果有嵌套的 div，你的正则表达式将无法工作
Some divs dont have an ending tag !(except XHTML) 有些div没有结束标记 ！（XHTML除外）

You can use this code to retrieve it using HtmlAgilityPack 您可以使用此代码使用HtmlAgilityPack检索它

HtmlDocument doc = new HtmlDocument();
doc.Load(yourStream);

var itemList = doc.DocumentNode.SelectNodes("//div[@id='thumbs']")//this xpath selects all div with thubs id
                  .Select(p => p.InnerText)
                  .ToList();

//itemList now contain all the div tags content having its id as thumbs

Answer 2

No I dont think he needs escapes. 不，我不认为他需要逃脱。 He has @ in front of pattern. 他在模式面前有@。 I think this is correct: 我认为这是正确的：

<div[^>]*id="thumbs">(.+?)</div>

So no double double quotes 所以没有双重双引号

Answer 3

Try this: 尝试这个：

Regex r = new Regex(@"(?<text>(<div\s*?id=(\""|&quot;|&\#34;)"
    + @"thumb(\""|&quot;|&\#34;).*?>)(?>.*?</div>|.*?<div "
    + @"(?>depth)|.*?</div> (?>-depth))*)(?(depth)(?!)).*?</div>",
    RegexOptions.Singleline);

C＃正则表达式提取div的内容

问题描述

3 个解决方案

解决方案1
8 已采纳 2013-07-04 12:45:27

解决方案2
1 2013-07-04 12:46:00

解决方案3
0 2013-07-04 12:46:20

C＃正则表达式提取div的内容

问题描述

3 个解决方案

解决方案1 8 已采纳 2013-07-04 12:45:27

解决方案2 1 2013-07-04 12:46:00

解决方案3 0 2013-07-04 12:46:20

解决方案1
8 已采纳 2013-07-04 12:45:27

解决方案2
1 2013-07-04 12:46:00

解决方案3
0 2013-07-04 12:46:20