extract data from HTML snippet using Regex in .net language

Question

Here is a sample

<tr>
  <td>
    <div class="VBChap"></div>
    <a href="/testing/1">Sample Textbook Chapter 1</a> : Introduction to VB.net
  </td>
  <td>09/24/2013</td>
</tr>

The document basically consists of these entries repeated over and over

I would like to extract the following:

the partial URL after href=".
The Chapter text
The Chapter Name
The Date

Currently I am using two separate queries to get the data

Query 1:

(?<=^|>)[^><]+?(?=<|$)

This extracts 2, 3 and 4.

Query 2:

(?<=<a href=")[^"]+

This extracts 1.

I want a single query that can extract all four.

Regex is something I am not good at. It took me 2 hours of trial and error to get this.

Answer 1

RegEx and HTML is a pain. If you have the scope to use it then the HTML Agility Pack is what you want. I wrote a quick intro into its use a couple of years ago.

Answer 2

If the HTML in question is valid XHTML you can parse it as XML, for which there is extensive support under System.XML .

You could then query with XPath;

...SelectNodes("//tr/td/a/@href").Value

and so on!

Most html on the internet is not valid xhtml, however, in which case HAP is very pleasant to use (and still allows querying by XPath, should you so choose)

Answer 3

Consider the following Regex...

((?<=href\=\").*?(?=\")|(?<=href\=\".*?\"\>).*?(?=\<)|(?<=\</.*?\>)[\s\S]*?(?=\<)|(?<=\<td\>).*?(?=</td\>))

Good Luck!

extract data from HTML snippet using Regex in .net language

Question

3 answers

solution1
1 2013-11-21 09:57:16

solution2
0 2013-11-21 10:56:37

solution3
0 ACCPTED 2013-11-21 23:13:45

extract data from HTML snippet using Regex in .net language

Question

3 answers

solution1 1 2013-11-21 09:57:16

solution2 0 2013-11-21 10:56:37

solution3 0 ACCPTED 2013-11-21 23:13:45

solution1
1 2013-11-21 09:57:16

solution2
0 2013-11-21 10:56:37

solution3
0 ACCPTED 2013-11-21 23:13:45