简体   繁体   English

Python机械化以遍历网站的所有网页

[英]Python mechanize to iterate over all webpages of a website

I want to iterate over all the webpages of a website. 我想遍历网站的所有网页。 I am trying to use mechanize here but it only looks over the main links of the website. 我在这里尝试使用机械化,但是它只查看网站的主要链接。 How should I modify it? 我应该如何修改?

import mechanize
import lxml.html

br = mechanize.Browser()
response = br.open("http://www.apple.com")

for link in br.links():
    print link.url
    br.follow_link(link)  # takes EITHER Link instance OR keyword args
    print br
    br.back()

This is the new code: 这是新代码:

import mechanize
import lxml.html

links  = set()                             
visited_links  = set()


def visit(br, url):
  response = br.open(url)
  links = br.links()
  for link in links:
    if not link.url in links:
      visited_links.add(link.url)  
      visit(br, link)
      print link.url


if __name__ == '__main__':
  br = mechanize.Browser()
  visit(br,"http://www.apple.com")

Notice how what you want to do for each link is the same as what you did for your initial link: fetch the page and visit each link. 请注意,您要对每个链接执行的操作与您对初始链接所做的操作相同:获取页面并访问每个链接。 You could solve this recursively like 您可以像这样递归解决

def visit(br, url):
  response = br.open(url)
  links = br.links()
  for link in links:
    print link.url
    visit(br, link)

It'll get a bit more complicated in practice: 在实践中会变得更加复杂:

  1. You need to detect cycles, ie if a.html links to b.html, and b.html links to a.html you don't want to play ping pong and go back and forth all the time. 您需要检测循环,即如果a.html链接到b.html,而b.html链接到a.html,则您不想一直打乒乓球并一直来回走。 So you probably need some way to tell whether you have visited a page already. 因此,您可能需要某种方法来判断您是否已经访问过某个页面。 Since you might find a lot of pages, you should have an efficient way to test whether you visited a page already. 由于您可能会找到很多页面,因此应该有一种有效的方法来测试您是否已经访问过页面。 One straightforward way might be to have a global Python set with the seen links. 一种简单的方法可能是set带有可见链接的全局Python。

  2. You need to make up your mind about when two links are equal, eg should http://www.apple.com/index.html and ' http://www.apple.com/index.html#someAnchor ` be equal or not? 您需要确定两个链接何时相等,例如http://www.apple.com/index.html和' http: //www.apple.com/ index.html# someAnchor`是否应相等或不? You might need to come up with some sort of "normalization" of links. 您可能需要提出某种形式的链接“标准化”。

  3. Your program might take a long time, and it most certainly will be "I/O bound", ie your program will sit there waiting for some page to download. 您的程序可能会花费很长时间,并且肯定会受到“ I / O限制”,即您的程序将坐在那里等待某些页面下载。 You could accelerate things by considering to visit multiple pages in parallel - they would need to use a shared set of seen pages though, so that two jobs don't visit the same page. 你可以考虑参观多个并行的网页加速的东西-他们需要使用一个共享的set ,虽然看到的网页,这样两个职位不访问同一页面。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM