简体   繁体   中英

HTML Content Parsing

I have two code for getting no of characters inside templates first one is

string html = this.GetHTMLContent(url);

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);

StringBuilder sb = new StringBuilder();
foreach (HtmlTextNode node in doc.DocumentNode.SelectNodes("//text()"))
{
    sb.AppendLine(node.InnerText);
}
string final = sb.ToString();
int lenght = final.Length; 

And second one is

var length = doc.DocumentNode.SelectNodes("//text()")
                .Where(x => x.NodeType == HtmlNodeType.Text)
                .Select(x => x.InnerText.Length)
                .Sum();

When I run both code return me different result.

Finally I identified the problem. the problem was inside loop I used appendLine() method instead of append() method. so it appended new line each time of looping. So that some white spaces it also recognized as character.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM