简体   繁体   中英

Extract Content from <div class=“ ”> </div> Tag C# RegEx

I have a code`

string tag = "div";
string pattern = string.Format(@"\<{0}.*?\>(?<tegData>.+?)\<\/{0}\>", tag.Trim());
Regex regex = new Regex(pattern, RegexOptions.ExplicitCapture);
MatchCollection matches = regex.Matches(data);

`

and i need to get content between <div class="in"> .... </div> tags

   <div class="in">
        <a href="/a/show/7184569" class="mm">ВАЗ 2121</a> <span class="for">за</span>    <span class="price">2 700 $</span></span><br/><span class="year">1990 г.</span><br/><div style="margin: 3px 0 3px 0">1.6 л, бензин, КПП механика, с пробегом, белый, литые диски, тонировка, спойлер, ветровики, противотуманки, Движок после капитального ремонта!</div><div>
     <span style="display:block; padding: 4px 0 0 0;"><span class="region">Костанай</span><span class="adv-phones">, +7 (777) 4464451</span></span>

            <small class="gray air">24 просмотра</small>


            <small class="gray air">13 июня</small>
    </div>
    <div class="selectItem" title="Выбрать" id="fv_sic_7184569">
        <a href="#" class="fav-button" id="fav_7184569">&nbsp;</a>           </div>
</div>

How can I do it? My code doesn't work.

Here's a regex that might extract simple div tags:

// <div[^>]*>(.+?)</div>

string tag = "div";
string pattern = string.Format(@"<{0}[^>]*>(?<tegData>.+?)</{0}>", tag.Trim());

However, using RegEx for HTML parsing is almost always inappropriate and guaranteed to not work properly. That is simply because markup languages such as HTML are not regular languages.

That being said you would be much better off using an XML parser to parse the document or fragment and then extract what you need. In fact, using a forward-only parser would probably even be faster than trying to use RegEx.

You should look at the XmlReader class in .NET .

If it doesn't have to be Server Side you could use some JavaScript to make this happen. Such as:

 <script language="javascript">
     function getData(){
          var divs = document.getElementByTagName('div');
          var data;
          var x;
          for(x = 0; x < divs.length; x++)
          {
            if(divs[x].className == 'in') 
            {
                data = divs[x].innerHTML;
            }
          }
     }
 </script>

To get nested tags try use this function:

public static MatchCollection ParseTag(string str, string tagpat, string argpat, string valpat) {
    if (null == tagpat) argpat = @"\w+";
    if (null == argpat) argpat = @"[^>]*";
    if (null == valpat) valpat = @"(?><\k'tag'\b[^>]*>(?'nst')|</\k'tag'>(?'-nst')|.?)*?(?(nst)(?!))";
    return Regex.Matches(str, @"(?><(?'tag'" + tagpat + @"\b)\s*(?'arg'" + argpat + @")>)(?'val'" + valpat + @")</\k'tag'>",
        RegexOptions.IgnoreCase | RegexOptions.Singleline);
}

Parameters are simple regexes to filter the target tag, here are examples:

ParseTag(page, "div", @"id=""content""\s+class=""mw-body""", null);
ParseTag(wikipage, "span", @"class=""bday""", @"\d{4}-\d{2}-\d{2}");

This variant handles opening and closing tags and nested tags of the same type (other nested tags can be broken and ignored).

The other variant checks nested tags more strict and does not match if some of them are mis-opened or closed:

if (null == valpat) valpat = @"(?><(?'itag'\w+)\b[^>]*>(?'nst')|</\k'itag'>(?'-nst')|.?)*?(?(nst)(?!))";

It much easier for me to use XPath. Maybe you will find it useful.

textBox2.Text = "<div style=\"padding: 5px; width: 212px\"><div>more text</div></div>";
string x = "//div[contains(@style,'padding: 5px; width: 212px;')]";
XmlDocument doc = new XmlDocument();
doc.LoadXml(textBox2.Text);

XmlNodeList nodes = doc.SelectNodes(textBox1.Text);
foreach(XmlNode node in nodes)
{
    textBox3.Text = node.InnerXml;
}

Code that worked for me for RegEx would find the first inner div.

string r = @"<div style=""padding: 5px; width: 212px;";
Regex rg = new Regex(r);

var matches = rg.Matches(s);
if (matches.Count > 0)
{
foreach (Match m in matches)
{
textBox3.Text += m.Groups[1];
}
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM