简体   繁体   中英

Best way to read and parse a text file in C#

I have a text file that contains an HTML code, and I want to take only specific tags and save them using C#!

I was thinking to do it with few Regex lines, is it the best and easiest way to do so?! or there's an easier function in C# that can do it?

Using Regex is probably not the best way to do this, actually I would say that it's one of the numerous "bad" ideas which you could think of.

You might want to look into using the HTMLAgilityPack : it will parse the HTML, create a tree of nodes which you can navigate and you will be able to look at the tags which you're interested without doing any "crazy" regex. You'll save yourself a lot of trouble if you avoid regex, since HTML as it is found in the wild can be poor, nasty and brutish, though quite often far from short .

Regex can work but you have to very careful. HTML is not a "regular language," so there are free form exceptions that can throw things off. You also have to be careful with matching across linebreaks. It can be done though.

Look into: http://htmlagilitypack.codeplex.com/

If the HTML is well formed, you could try reading it in using an XML parser and use the methods there. Fortunately there are tools immediately available in the framework to do this. Look into using LINQ to XML to make your job as simple as possible.

Otherwise if it is not well formed, you could use a third-party tool to parse it such as HTML Agility Pack .

Using regex to parse HTML has been covered at length on SO. The consensus is that it should not be done. Give this post a read to understand why:

RegEx match open tags except XHTML self-contained tags

In the past I have used SGML reader to convert HTML to xml and then used xpath/xslt/linq-to-xml to parse it. This might work for you as well.

2 options :

1) go with you own loop

2) use regex for much better matching and errors. ( youll ghet matched groups to your regex) and then you can iterate each one of item inside them

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM