快速查找链接：正则表达式与lxml

Question

I am trying to build a fast web crawler, and as a result, I need an efficient way to locate all the links on a page. 我正在尝试构建一个快速的Web爬虫，因此，我需要一种有效的方法来查找页面上的所有链接。 What is the performance comparison between a fast XML/HTML parser like lxml and using regex matching? 快速XML / HTML解析器（如lxml）和使用正则表达式匹配之间的性能比较是什么？

Answer 1

The problem here isn't about regex vs lxml. 这里的问题不是关于正则表达式与lxml。 Regex just isn't a solution. 正则表达式不是一个解决方案。 How would you restrict the elements from where the links come from? 你如何限制链接来自哪里的元素？ A more real-world example is malformed HTML. 一个更现实世界的例子是格式错误的HTML。 How would you extract the contents of the href attribute out of this link? 如何从这个链接中提取href属性的内容？

<A href = /text" data-href='foo>' >Test</a>

lxml parses it just fine, just like Chrome, but good luck getting a regex to work. lxml解析它就好了，就像Chrome一样，但运气正常的好运。 If you're curious about the actual speed differences, here's a quick test I made. 如果你对实际的速度差异感到好奇，这是我做的一个快速测试。

Setup: 设定：

import re
import lxml.html

def test_lxml(html):
    root = lxml.html.fromstring(html)
    #root.make_links_absolute('http://stackoverflow.com/')

    for href in root.xpath('//a/@href'):
        yield href

LINK_REGEX = re.compile(r'href="(.*?)"')

def test_regex(html):
    for href in LINK_REGEX.finditer(html):
        yield href.group(1)

Test HTML: 测试HTML：

html = requests.get('http://stackoverflow.com/questions?pagesize=50').text

Results: 结果：

In [22]: %timeit list(test_lxml(html))
100 loops, best of 3: 9.05 ms per loop

In [23]: %timeit list(test_regex(html))
1000 loops, best of 3: 582 us per loop

In [24]: len(list(test_lxml(html)))
Out[24]: 412

In [25]: len(list(test_regex(html)))
Out[25]: 416

For comparison, here's how many links Chrome picks out: 相比之下，以下是Chrome选择的链接数量：

> document.querySelectorAll('a[href]').length
413

Also, just for the record, Scrapy is one of the best web scraping frameworks out there and it uses lxml to parse the HTML. 此外，仅仅为了记录， Scrapy是最好的网络抓取框架之一，它使用lxml来解析HTML。

Answer 2

你可以使用pyquery，一个python库，为你带来jquery的功能。

快速查找链接：正则表达式与lxml

问题描述

2 个解决方案

解决方案1
6 2013-06-05 00:10:02

解决方案2
-2 2013-06-05 06:50:39

快速查找链接：正则表达式与lxml

问题描述

2 个解决方案

解决方案1 6 2013-06-05 00:10:02

解决方案2 -2 2013-06-05 06:50:39

解决方案1
6 2013-06-05 00:10:02

解决方案2
-2 2013-06-05 06:50:39