c# regex parsing

Question

I am trying to parse data from a very long html content. I am just pasting here the important part I am interested in:

Technical Details

<div class="content">

    <ul style="list-style: disc; padding-left: 25px;">

      <li>1920x1080 Full HD 60p/24p Recording w/7MP still image</li>
      <li>32GB Flash Memory for up to 13 hours (LP mode) of HD recording</li>
      <li>Project your videos on the go anywhere, anytime.</li>
      <li>Wide Angle G lens to capture everything you want.</li>
      <li>Back-illuminated "Exmor R" CMOS sensor for superb low-light video</li>

    </ul>

  <div id="technicalProductFeatures"></div>

I need to start parsing from :

<div class="content">

til

<ul

and then until

</ul>

I have tried following regex but it did not work:

Regex specsRegex = new Regex ("<div class=\"content\">[\\s]*<ul.[\\s]*</ul>");

this gives me nothing..

One other issue is sometimes it has a linebreak and sometimes not between initial div and ul tags like:

<div class="content">
<ul style="list-style: disc; padding-left: 25px;">

or

<div class="content">

<ul style="list-style: disc; padding-left: 25px;">

thanks for any help.

Answer 1

I wouldn't suggest using regular expressions for this. It's like trying to fix a tire with a hammer. The hammer is a good tool, but it's not for everything.

I'd use Html Agility Pack . It's not clear to me exactly what you're looking to extract. But I'll assume it's the list items. So you'd do something like this...

var hdoc = new HtmlAgilityPack.HtmlDocument();
hdoc.LoadHtml(YourHtmlGoesHere);

var MatchingNodes = hdoc.DocumentNode.SelectNodes("/html/body/div/ul/li");

As you can see, the syntax for the Html Agility Pack is based on XPATH and is much simpler for this task. It's also much more robust and something as silly as nested tags or a comment is not going to throw it off. Those types of things can throw off even the most carefully written regular expression in this scenario.

UPDATE

If you were determined to create a quick & dirty regular expression for this, it'd be something like this ...

<div class="content">.*?</ul>

Ordinarily the .*? part matches anything except lines feeds 0 or more times, as few times as possible. So be sure to use RegexOptions.Singleline so that the . will match line feeds as well. This should work for the example you've given, but a commented bit of code with </ul> in it could throw it off, or a nested <ul></ul> could throw it off as well.

UPDATE #2

This will grab everything between the <ul></ul> ...

(?<=<div class="content">\s*<ul[^>]*>).*?(?=</ul>)

Again, be sure to use RegexOptions.Singleline .

Answer 2

Regex isn't the best tool to parse html (to put it mildly). Use HtmlAgilityPack .

c# regex parsing

Question

Technical Details

2 answers

solution1
3 ACCPTED 2011-10-03 14:31:38

solution2
2 2011-10-03 14:24:01

c# regex parsing

Question

Technical Details

2 answers

solution1 3 ACCPTED 2011-10-03 14:31:38

solution2 2 2011-10-03 14:24:01

solution1
3 ACCPTED 2011-10-03 14:31:38

solution2
2 2011-10-03 14:24:01