简体   繁体   English

从中提取内容 <div class=“ ”></div> 标签C#RegEx

[英]Extract Content from <div class=“ ”> </div> Tag C# RegEx

I have a code` 我有一个代码。

string tag = "div";
string pattern = string.Format(@"\<{0}.*?\>(?<tegData>.+?)\<\/{0}\>", tag.Trim());
Regex regex = new Regex(pattern, RegexOptions.ExplicitCapture);
MatchCollection matches = regex.Matches(data);

` `

and i need to get content between <div class="in"> .... </div> tags 我需要在<div class="in"> .... </div>标记之间获取内容

   <div class="in">
        <a href="/a/show/7184569" class="mm">ВАЗ 2121</a> <span class="for">за</span>    <span class="price">2 700 $</span></span><br/><span class="year">1990 г.</span><br/><div style="margin: 3px 0 3px 0">1.6 л, бензин, КПП механика, с пробегом, белый, литые диски, тонировка, спойлер, ветровики, противотуманки, Движок после капитального ремонта!</div><div>
     <span style="display:block; padding: 4px 0 0 0;"><span class="region">Костанай</span><span class="adv-phones">, +7 (777) 4464451</span></span>

            <small class="gray air">24 просмотра</small>


            <small class="gray air">13 июня</small>
    </div>
    <div class="selectItem" title="Выбрать" id="fv_sic_7184569">
        <a href="#" class="fav-button" id="fav_7184569">&nbsp;</a>           </div>
</div>

How can I do it? 我该怎么做? My code doesn't work. 我的代码不起作用。

Here's a regex that might extract simple div tags: 这是一个可提取简单div标签的正则表达式:

// <div[^>]*>(.+?)</div>

string tag = "div";
string pattern = string.Format(@"<{0}[^>]*>(?<tegData>.+?)</{0}>", tag.Trim());

However, using RegEx for HTML parsing is almost always inappropriate and guaranteed to not work properly. 但是,使用RegEx进行HTML解析几乎总是不合适的,并且保证不能正常工作。 That is simply because markup languages such as HTML are not regular languages. 这仅仅是因为诸如HTML之类的标记语言不是常规语言。

That being said you would be much better off using an XML parser to parse the document or fragment and then extract what you need. 话虽这么说,您最好使用XML解析器来解析文档或片段,然后提取所需的内容。 In fact, using a forward-only parser would probably even be faster than trying to use RegEx. 实际上,使用仅转发解析器甚至可能比尝试使用RegEx更快。

You should look at the XmlReader class in .NET . 您应该查看.NET中的XmlReader类

If it doesn't have to be Server Side you could use some JavaScript to make this happen. 如果不必是服务器端,则可以使用一些JavaScript来实现。 Such as: 如:

 <script language="javascript">
     function getData(){
          var divs = document.getElementByTagName('div');
          var data;
          var x;
          for(x = 0; x < divs.length; x++)
          {
            if(divs[x].className == 'in') 
            {
                data = divs[x].innerHTML;
            }
          }
     }
 </script>

To get nested tags try use this function: 要获取嵌套标签,请尝试使用以下功能:

public static MatchCollection ParseTag(string str, string tagpat, string argpat, string valpat) {
    if (null == tagpat) argpat = @"\w+";
    if (null == argpat) argpat = @"[^>]*";
    if (null == valpat) valpat = @"(?><\k'tag'\b[^>]*>(?'nst')|</\k'tag'>(?'-nst')|.?)*?(?(nst)(?!))";
    return Regex.Matches(str, @"(?><(?'tag'" + tagpat + @"\b)\s*(?'arg'" + argpat + @")>)(?'val'" + valpat + @")</\k'tag'>",
        RegexOptions.IgnoreCase | RegexOptions.Singleline);
}

Parameters are simple regexes to filter the target tag, here are examples: 参数是用于过滤目标标记的简单正则表达式,下面是示例:

ParseTag(page, "div", @"id=""content""\s+class=""mw-body""", null);
ParseTag(wikipage, "span", @"class=""bday""", @"\d{4}-\d{2}-\d{2}");

This variant handles opening and closing tags and nested tags of the same type (other nested tags can be broken and ignored). 此变体处理打开和关闭标签以及相同类型的嵌套标签(其他嵌套标签可以被破坏和忽略)。

The other variant checks nested tags more strict and does not match if some of them are mis-opened or closed: 另一个变体会更严格地检查嵌套标签,如果其中一些打开错误或关闭错误,则不匹配:

if (null == valpat) valpat = @"(?><(?'itag'\w+)\b[^>]*>(?'nst')|</\k'itag'>(?'-nst')|.?)*?(?(nst)(?!))";

It much easier for me to use XPath. 对我来说,使用XPath更容易。 Maybe you will find it useful. 也许您会发现它很有用。

textBox2.Text = "<div style=\"padding: 5px; width: 212px\"><div>more text</div></div>";
string x = "//div[contains(@style,'padding: 5px; width: 212px;')]";
XmlDocument doc = new XmlDocument();
doc.LoadXml(textBox2.Text);

XmlNodeList nodes = doc.SelectNodes(textBox1.Text);
foreach(XmlNode node in nodes)
{
    textBox3.Text = node.InnerXml;
}

Code that worked for me for RegEx would find the first inner div. 适用于RegEx的代码将找到第一个内部div。

string r = @"<div style=""padding: 5px; width: 212px;";
Regex rg = new Regex(r);

var matches = rg.Matches(s);
if (matches.Count > 0)
{
foreach (Match m in matches)
{
textBox3.Text += m.Groups[1];
}
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM