简体   繁体   English

C#正则表达式提取div的内容

[英]C# Regex extract content of a div

I've seen some related questions of mine, and I tried them but they don't work. 我已经看到了我的一些相关问题,我尝试了它们但是它们不起作用。 I want to match the content from a div with the id "thumbs". 我想匹配div中的内容和id“thumbs”。 But the regex.Success returns false :( 但是regex.Success返回false :(

Match regex = Regex.Match(html, @"<div[^>]*id=""thumbs"">(.+?)</div>");

Regex is not a good choice for parsing HTML files.. 正则表达式不是解析HTML文件的好选择。

HTML is not strict nor is it regular with its format.. HTML格式不严格,格式也不规则。

Use htmlagilitypack 使用htmlagilitypack


Why use parser? 为什么要使用解析器?

Consider your regex..There are infinite number of cases where you could break your code 考虑你的正则表达式。有无数种情况你可以破坏你的代码

  • Your regex won't work if there are nested divs 如果有嵌套的 div,你的正则表达式将无法工作
  • Some divs dont have an ending tag !(except XHTML) 有些div没有结束标记 !(XHTML除外)

You can use this code to retrieve it using HtmlAgilityPack 您可以使用此代码使用HtmlAgilityPack检索它

HtmlDocument doc = new HtmlDocument();
doc.Load(yourStream);

var itemList = doc.DocumentNode.SelectNodes("//div[@id='thumbs']")//this xpath selects all div with thubs id
                  .Select(p => p.InnerText)
                  .ToList();

//itemList now contain all the div tags content having its id as thumbs

No I dont think he needs escapes. 不,我不认为他需要逃脱。 He has @ in front of pattern. 他在模式面前有@。 I think this is correct: 我认为这是正确的:

<div[^>]*id="thumbs">(.+?)</div>

So no double double quotes 所以没有双重双引号

Try this: 尝试这个:

Regex r = new Regex(@"(?<text>(<div\s*?id=(\""|&quot;|&\#34;)"
    + @"thumb(\""|&quot;|&\#34;).*?>)(?>.*?</div>|.*?<div "
    + @"(?>depth)|.*?</div> (?>-depth))*)(?(depth)(?!)).*?</div>",
    RegexOptions.Singleline);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM