简体   繁体   中英

Need help for parsing HTML in C#

For personal use i am trying to parse a little html page that show in a simple grid the result of the french soccer championship.

var Url = "http://www.lfp.fr/mobile/ligue1/resultat.asp?code_jr_tr=J01";
WebResponse result = null;
WebRequest req = WebRequest.Create(Url);
result = req.GetResponse();
Stream ReceiveStream = result.GetResponseStream();
Encoding encode = System.Text.Encoding.GetEncoding(0);
StreamReader sr = new StreamReader(ReceiveStream, encode);

                while (sr.Read() != -1)
                {
                    Line = sr.ReadLine();
                    Line = Regex.Replace(Line, @"<(.|\n)*?>", " ");
                    Line = Line.Replace("&nbsp;", "");
                    Line = Line.TrimEnd();
                    Line = Line.TrimStart();

and then i really dont have a clue either take line by line or the whole stream at one and how to retreive only the team's name with the next number that would be the score.

At the end i want to put both 2 team's with scores in a liste or xml to use it with an phone application

If anyone has an idea it would be great thanks!

You could put the stream into an XmlDocument , allowing you to query via something like XPath . Or you could use LINQ to XML with an XDocument .

It's not perfect though, because HTML files aren't always well-formed XML (don't we know it!), but it's a simple solution using stuff already available in the framework.

您将需要一个SgmlReader ,它可以在任何SGML文档(实际上是HTML文档)上提供类似XML的API。

You could use the Regex.Match method to pull out the team name and score. Examine the html to see how each row is built up. This is a common technique in screen scraping.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM