簡體 English 中英

Python機械化以遍歷網站的所有網頁

[英]Python mechanize to iterate over all webpages of a website

原文 2013-12-19 17:28:24 5 1 python/ mechanize

我想遍歷網站的所有網頁。 我在這里嘗試使用機械化，但是它只查看網站的主要鏈接。 我應該如何修改？

import mechanize
import lxml.html

br = mechanize.Browser()
response = br.open("http://www.apple.com")

for link in br.links():
    print link.url
    br.follow_link(link)  # takes EITHER Link instance OR keyword args
    print br
    br.back()

這是新代碼：

import mechanize
import lxml.html

links  = set()                             
visited_links  = set()


def visit(br, url):
  response = br.open(url)
  links = br.links()
  for link in links:
    if not link.url in links:
      visited_links.add(link.url)  
      visit(br, link)
      print link.url


if __name__ == '__main__':
  br = mechanize.Browser()
  visit(br,"http://www.apple.com")

1 個解決方案

請注意，您要對每個鏈接執行的操作與您對初始鏈接所做的操作相同：獲取頁面並訪問每個鏈接。 您可以像這樣遞歸解決

def visit(br, url):
  response = br.open(url)
  links = br.links()
  for link in links:
    print link.url
    visit(br, link)

在實踐中會變得更加復雜：

您需要檢測循環，即如果a.html鏈接到b.html，而b.html鏈接到a.html，則您不想一直打乒乓球並一直來回走。 因此，您可能需要某種方法來判斷您是否已經訪問過某個頁面。 由於您可能會找到很多頁面，因此應該有一種有效的方法來測試您是否已經訪問過頁面。 一種簡單的方法可能是set帶有可見鏈接的全局Python。
您需要確定兩個鏈接何時相等，例如http://www.apple.com/index.html和' http: //www.apple.com/ index.html# someAnchor`是否應相等或不？ 您可能需要提出某種形式的鏈接“標准化”。
您的程序可能會花費很長時間，並且肯定會受到“ I / O限制”，即您的程序將坐在那里等待某些頁面下載。 你可以考慮參觀多個並行的網頁加速的東西-他們需要使用一個共享的set ，雖然看到的網頁，這樣兩個職位不訪問同一頁面。

python機械化選擇選項並使用lambda提交

[英]python mechanize iterate over select options and submit using lambda

Python - 迭代所有類

[英]Python - Iterate over all classes

Python機械化登錄網站

[英]Python mechanize login to website

使用Python和Mechanize登錄網站

[英]Login to a website using Python and Mechanize

如何使用python和mechanize登錄網站

[英]How to login to a website with python and mechanize

python機械化訪問Sharepoint網站

[英]python mechanize to access Sharepoint website

如何遍歷字典python中的所有鍵？

[英]how to iterate over all keys in dictionary python?

Python：在一個月的所有天中進行迭代

[英]Python: Iterate over all days in a month

使用 python 為我的網站創建臨時網頁

[英]creating temporary webpages using python for my website

使用Python遍歷不同的網頁

[英]Loop over different webpages using Python

暫無

暫無

聲明:本站的技術帖子網頁，遵循CC BY-SA 4.0協議，如果您需要轉載，請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

相關問題 python機械化選擇選項並使用lambda提交 Python - 迭代所有類 Python機械化登錄網站使用Python和Mechanize登錄網站如何使用python和mechanize登錄網站 python機械化訪問Sharepoint網站如何遍歷字典python中的所有鍵？ Python：在一個月的所有天中進行迭代使用 python 為我的網站創建臨時網頁使用Python遍歷不同的網頁

相關標簽

粵ICP備18138465號 © 2020-2024 STACKOOM.COM