简体繁体中英

C# HTMLAgilityPack VS regular expressions for extracting links from HTML

原文 2017-04-28 10:13:56 4 1 c#/ regex/ html-parsing/ html-agility-pack

I'm writing a C# web crawler and when I run the profiling I can see that HTMLAgilityPack's LoadHTML method is using 10% of the programs overall CPU usage. I'd like to try and lower this.

I'm sure a regular expression would be faster but as I look at link extracting examples on SO I see everyone saying this method should be avoided in favour of a html parser like HTMLAgilityPack .

As all I need to do is extract links from HTML is using HTMLAgilityPack over kill?

Are the reasons for favouring a HTML parser applicable to my case as I'm only using it for extracting links?

Downloaded HTML with WebClient then compared.

Using href\\\\s*=\\\\s*(?:[\\"'](?<1>[^\\"']*)[\\"']|(?<1>\\\\S+)) (then trimming and adding to a list) is way faster than HTMLAgilityPack.

43 milliseconds compared to 3 consistently.

See my code on pastebin

1 answers

Are the reasons for favouring a HTML parser applicable to my case as I'm only using it for extracting links?

In your case the HTML parser is overkill as your tests have shown.

People who answer on SO use that as a rote answer to all regex questions. One should use the tool if one actually needs to parse the domain of the HTML in a more robust fashion.

Bias against Regular Expressions are found by people who feel that they are too slow or cumbersome [to learn]. There is some merit on what is proposed by them for certain operations, in that specific optimized text for finding utilities do perform better. Sure I agree, but to dismiss regex out of hand, well that is par for the course on StackOverflow.

Why is that? Sometimes the analysis is simply flawed because the pattern provided introduces a lot of unnecessary backtracking and is not optimized. That handicaps regex out of the gate. One does have to learn the regex language and understand what it is doing to tune the engine of regex to not pollute.

For example I took your same C# code test, but I used an optimized pattern of yours and my own and was able to get it down to 1 millisecond consistently!

Most people learn basic pattern matching by doing searches with a * . When they first learn regex they use * with the . such as .* . That step along with indiscriminate usage of the * will most likely will doom any non beginning pattern to the hell of backtracking and slow responses.

Unless you know empirically that there are no items, use the + instead.

Back in 2009 I wrote about this subject on my blog Are C# .Net Regular Expressions Fast Enough for You?

Regular expressions in C# for extracting parts

C# regular expressions with HTML strings

Getting results from a regular expressions in c#

Regular Expressions, C#

C# Regular Expressions

Regular Expressions in C#

Regular Expressions C#

html parse with HtmlAgilityPack in C#

HTML Parsing C# HTMLAgilityPack

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Regular expressions in C# for extracting parts C# regular expressions with HTML strings Getting results from a regular expressions in c# Regular Expressions, C# C# Regular Expressions C# Regular Expressions Regular Expressions in C# Regular Expressions C# html parse with HtmlAgilityPack in C# HTML Parsing C# HTMLAgilityPack

Related Tags

C# HTMLAgilityPack VS regular expressions for extracting links from HTML

Question

1 answers

solution1 2 ACCPTED 2017-05-09 22:49:50

solution1
2 ACCPTED 2017-05-09 22:49:50