繁体 English 中英

Python机械化以遍历网站的所有网页

[英]Python mechanize to iterate over all webpages of a website

原文 2013-12-19 17:28:24 1 1 python/ mechanize

我想遍历网站的所有网页。 我在这里尝试使用机械化，但是它只查看网站的主要链接。 我应该如何修改？

import mechanize
import lxml.html

br = mechanize.Browser()
response = br.open("http://www.apple.com")

for link in br.links():
    print link.url
    br.follow_link(link)  # takes EITHER Link instance OR keyword args
    print br
    br.back()

这是新代码：

import mechanize
import lxml.html

links  = set()                             
visited_links  = set()


def visit(br, url):
  response = br.open(url)
  links = br.links()
  for link in links:
    if not link.url in links:
      visited_links.add(link.url)  
      visit(br, link)
      print link.url


if __name__ == '__main__':
  br = mechanize.Browser()
  visit(br,"http://www.apple.com")

1 个解决方案

请注意，您要对每个链接执行的操作与您对初始链接所做的操作相同：获取页面并访问每个链接。 您可以像这样递归解决

def visit(br, url):
  response = br.open(url)
  links = br.links()
  for link in links:
    print link.url
    visit(br, link)

在实践中会变得更加复杂：

您需要检测循环，即如果a.html链接到b.html，而b.html链接到a.html，则您不想一直打乒乓球并一直来回走。 因此，您可能需要某种方法来判断您是否已经访问过某个页面。 由于您可能会找到很多页面，因此应该有一种有效的方法来测试您是否已经访问过页面。 一种简单的方法可能是set带有可见链接的全局Python。
您需要确定两个链接何时相等，例如http://www.apple.com/index.html和' http: //www.apple.com/ index.html# someAnchor`是否应相等或不？ 您可能需要提出某种形式的链接“标准化”。
您的程序可能会花费很长时间，并且肯定会受到“ I / O限制”，即您的程序将坐在那里等待某些页面下载。 你可以考虑参观多个并行的网页加速的东西-他们需要使用一个共享的set ，虽然看到的网页，这样两个职位不访问同一页面。

python机械化选择选项并使用lambda提交

[英]python mechanize iterate over select options and submit using lambda

Python - 迭代所有类

[英]Python - Iterate over all classes

Python机械化登录网站

[英]Python mechanize login to website

使用Python和Mechanize登录网站

[英]Login to a website using Python and Mechanize

如何使用python和mechanize登录网站

[英]How to login to a website with python and mechanize

python机械化访问Sharepoint网站

[英]python mechanize to access Sharepoint website

如何遍历字典python中的所有键？

[英]how to iterate over all keys in dictionary python?

Python：在一个月的所有天中进行迭代

[英]Python: Iterate over all days in a month

使用 python 为我的网站创建临时网页

[英]creating temporary webpages using python for my website

使用Python遍历不同的网页

[英]Loop over different webpages using Python

暂无

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 python机械化选择选项并使用lambda提交 Python - 迭代所有类 Python机械化登录网站使用Python和Mechanize登录网站如何使用python和mechanize登录网站 python机械化访问Sharepoint网站如何遍历字典python中的所有键？ Python：在一个月的所有天中进行迭代使用 python 为我的网站创建临时网页使用Python遍历不同的网页

相关标签

粤ICP备18138465号 © 2020-2024 STACKOOM.COM