简体繁体 English

C＃HTMLAgilityPack VS正则表达式，用于从HTML提取链接

[英]C# HTMLAgilityPack VS regular expressions for extracting links from HTML

原文 2017-04-28 10:13:56 2 1 c#/ regex/ html-parsing/ html-agility-pack

I'm writing a C# web crawler and when I run the profiling I can see that HTMLAgilityPack's LoadHTML method is using 10% of the programs overall CPU usage. 我正在编写一个C＃Web HTMLAgilityPack's LoadHTML器，运行分析时，我可以看到HTMLAgilityPack's LoadHTML方法使用了程序总体CPU使用率的10％。 I'd like to try and lower this. 我想尝试降低这个。

I'm sure a regular expression would be faster but as I look at link extracting examples on SO I see everyone saying this method should be avoided in favour of a html parser like HTMLAgilityPack . 我确定正则表达式会更快，但是当我看SO上的链接提取示例时，我看到每个人都说应避免使用此方法，而应使用HTMLAgilityPack之类的html解析器。

As all I need to do is extract links from HTML is using HTMLAgilityPack over kill? 我需要做的就是从HTML提取链接是否正在使用HTMLAgilityPack杀死？

Are the reasons for favouring a HTML parser applicable to my case as I'm only using it for extracting links? 支持HTML解析器的原因是否适用于我的情况，因为我仅将其用于提取链接？

Downloaded HTML with WebClient then compared. 然后使用WebClient下载HTML。

Using href\\\\s*=\\\\s*(?:[\\"'](?<1>[^\\"']*)[\\"']|(?<1>\\\\S+)) (then trimming and adding to a list) is way faster than HTMLAgilityPack. 使用href\\\\s*=\\\\s*(?:[\\"'](?<1>[^\\"']*)[\\"']|(?<1>\\\\S+)) （然后修剪并添加到列表）比HTMLAgilityPack快得多。

43 milliseconds compared to 3 consistently. 43毫秒，而始终为3毫秒。

See my code on pastebin 在pastebin上查看我的代码

1 个解决方案

Are the reasons for favouring a HTML parser applicable to my case as I'm only using it for extracting links? 支持HTML解析器的原因是否适用于我的情况，因为我仅将其用于提取链接？

In your case the HTML parser is overkill as your tests have shown. 如您的测试所示，在您的情况下，HTML解析器过大了。

People who answer on SO use that as a rote answer to all regex questions. 依此类推的人将其作为对所有正则表达式问题的死记硬背的答案。 One should use the tool if one actually needs to parse the domain of the HTML in a more robust fashion. 如果人们确实需要以一种更强大的方式解析HTML的域，则应该使用该工具。

Bias against Regular Expressions are found by people who feel that they are too slow or cumbersome [to learn]. 对正则表达式的偏见是由觉得自己太慢或太麻烦[无法学习]的人发现的。 There is some merit on what is proposed by them for certain operations, in that specific optimized text for finding utilities do perform better. 他们为某些操作提出的建议有一些优点，因为用于查找实用程序的特定优化文本的性能更好。 Sure I agree, but to dismiss regex out of hand, well that is par for the course on StackOverflow. 当然，我同意，但是不要直接使用正则表达式，这与StackOverflow上的课程是一样的。

Why is that? 这是为什么 ？ Sometimes the analysis is simply flawed because the pattern provided introduces a lot of unnecessary backtracking and is not optimized. 有时分析仅仅是有缺陷的，因为提供的模式会引入很多不必要的回溯并且没有进行优化。 That handicaps regex out of the gate. 这妨碍了正则表达式的发展。 One does have to learn the regex language and understand what it is doing to tune the engine of regex to not pollute. 确实必须学习正则表达式语言，并了解它正在做些什么以调整正则表达式的引擎以使其不会污染。

For example I took your same C# code test, but I used an optimized pattern of yours and my own and was able to get it down to 1 millisecond consistently! 比如我拿了你一样的C＃代码测试，但我用你的优化模式和我自己的，是能够得到它下降到1毫秒一致！

Most people learn basic pattern matching by doing searches with a * . 大多数人通过使用*进行搜索来学习基本模式匹配。 When they first learn regex they use * with the . 当他们第一次学习正则表达式时，会将*与一起使用. such as .* . 例如.* 。 That step along with indiscriminate usage of the * will most likely will doom any non beginning pattern to the hell of backtracking and slow responses. 这一步骤以及对*不加选择的使用将很可能使任何非开始的模式注定会陷入回溯和响应缓慢的境地。