简体   繁体   中英

extract data from HTML snippet using Regex in .net language

Here is a sample

<tr>
  <td>
    <div class="VBChap"></div>
    <a href="/testing/1">Sample Textbook Chapter 1</a> : Introduction to VB.net
  </td>
  <td>09/24/2013</td>
</tr>

The document basically consists of these entries repeated over and over

I would like to extract the following:

  1. the partial URL after href=".
  2. The Chapter text
  3. The Chapter Name
  4. The Date

Currently I am using two separate queries to get the data

Query 1:

(?<=^|>)[^><]+?(?=<|$)

This extracts 2, 3 and 4.

Query 2:

(?<=<a href=")[^"]+

This extracts 1.

I want a single query that can extract all four.

Regex is something I am not good at. It took me 2 hours of trial and error to get this.

RegEx and HTML is a pain. If you have the scope to use it then the HTML Agility Pack is what you want. I wrote a quick intro into its use a couple of years ago.

If the HTML in question is valid XHTML you can parse it as XML, for which there is extensive support under System.XML .

You could then query with XPath;

...SelectNodes("//tr/td/a/@href").Value

and so on!

Most html on the internet is not valid xhtml, however, in which case HAP is very pleasant to use (and still allows querying by XPath, should you so choose)

Consider the following Regex...

((?<=href\=\").*?(?=\")|(?<=href\=\".*?\"\>).*?(?=\<)|(?<=\</.*?\>)[\s\S]*?(?=\<)|(?<=\<td\>).*?(?=</td\>))

Good Luck!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM