简体   繁体   中英

Parsing HTML page in ASP.NET

I'm trying to parse HTML on an external page and read its contents (eg. get "title" element from google.com). XmlDataSource does not appear to be working because it's not clean XML, does anybody know how to do this?

Thank you.

你应该使用Html Agility Pack

If it's something simple, can you just do some basic string parsing? It's not the most efficient but if works well enough.

First get your html (in case this is part of what you needed):

WebClient client = new WebClient();
string webhtml = client.DownloadString(strURL);

If you have a repeating pattern, you can then use .Split to divide it up.

Now just use .IndexOf (or .LastIndexOf) and .Substring to parse as needed. If you need to do this a lot, or iteratively, you can create a function where you pass the html and the start and end delimiters - plus a few other parameters as needed. You'll need to offset the start delimiter by adding the length of the string to the index but otherwise it's fairly straightforward.

Use Sgml Reader (http://sourceforge.net/projects/dekiwiki/files/SgmlReader/) if you are interested in treating HTML like XML for parsing. While this may be overkill for getting the title, it will be faster than other similar methods when parsing large HTML pages.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM