無法推動所生成的下一頁的鏈接以遞歸方式進行爬網

Question

我創建的搜尋器正在從網頁中獲取名稱和網址。 現在，我完全不知道讓我的搜尋器使用next_page生成的鏈接來從下一頁獲取數據。 對於使用類創建搜尋器，我是一個新手，這是因為我無法進一步思考。 我已經采取主動對代碼稍作改動，但是它既不會帶來任何結果，也不會引發任何錯誤。 希望有人來看看它。

import requests
from lxml import html

class wiseowl:
    def __init__(self,start_url):
        self.start_url=start_url
        self.storage=[]

    def crawl(self):
        self.get_link(self.start_url)

    def get_link(self,link):
        url="http://www.wiseowl.co.uk"
        response=requests.get(link)
        tree=html.fromstring(response.text)
        name=tree.xpath("//p[@class='woVideoListDefaultSeriesTitle']/a/text()")
        urls=tree.xpath("//p[@class='woVideoListDefaultSeriesTitle']/a/@href")
        docs=(name,urls)
        self.storage.append(docs)

        next_page=tree.xpath("//div[contains(concat(' ', @class, ' '), ' woPaging ')]//a[@class='woPagingItem']/@href")
        for npage in next_page:
            if npage is not None:
                self.get_link(url+npage)


    def __str__(self):
        return "{}".format(self.storage)


crawler=wiseowl("http://www.wiseowl.co.uk/videos/")
crawler.crawl()
for item in crawler.storage:
    print(item)

Answer 1

我修改了課程的某些部分，嘗試一下：

class wiseowl:
    def __init__(self,start_url):
        self.start_url=start_url
        self.links = [ self.start_url ]    #  a list of links to crawl # 
        self.storage=[]

    def crawl(self): 
        for link in self.links :    # call get_link for every link in self.links #
            self.get_link(link)

    def get_link(self,link):
        print('Crawling: ' + link)
        url="http://www.wiseowl.co.uk"
        response=requests.get(link)
        tree=html.fromstring(response.text)
        name=tree.xpath("//p[@class='woVideoListDefaultSeriesTitle']/a/text()")
        urls=tree.xpath("//p[@class='woVideoListDefaultSeriesTitle']/a/@href")
        docs=(name,urls)
        #docs=(name, [url+u for u in urls])    # use this line if you want to join the urls # 
        self.storage.append(docs)
        next_page=tree.xpath("//div[contains(concat(' ', @class, ' '), ' woPaging ')]//*[@class='woPagingItem' or @class='woPagingNext']/@href")    # get links form 'woPagingItem' or 'woPagingNext' # 
        for npage in next_page:
            if npage and url+npage not in self.links :    # don't get the same link twice # 
                self.links += [ url+npage ]

    def __str__(self):
        return "{}".format(self.storage)

crawler=wiseowl("http://www.wiseowl.co.uk/videos/")
crawler.crawl()
for item in crawler.storage:
    item = zip(item[0], item[1])
    for i in item : 
        print('{:60} {}'.format(i[0], i[1]))    # you can change 60 to the value you want #

Answer 2

您應該考慮利用某種數據結構來保存訪問過的鏈接（以避免無限循環）以及尚未訪問的鏈接的容器。 從本質上講，爬網是互聯網的廣度優先搜索。 因此，您應該使用Google廣度優先搜索來更好地了解底層算法。

為需要訪問的鏈接實現隊列。 每次您訪問鏈接時，請刮除所有鏈接的頁面並使其入隊。
用Python或字典實現一個集合，以檢查您已入隊的每個鏈接是否已被訪問，如果已被訪問，則不要將其入隊。

您的搜尋器方法應類似於：

 def crawler(self): while len(self.queue): curr_link = self.queue.pop(0) # process curr_link here -> scrape and add more links to queue # mark curr_link as visited

無法推動所生成的下一頁的鏈接以遞歸方式進行爬網

問題描述

2 個解決方案

解決方案1
1 已采納 2017-05-11 16:37:40

解決方案2
0 2017-05-11 12:39:48

無法推動所生成的下一頁的鏈接以遞歸方式進行爬網

問題描述

2 個解決方案

解決方案1 1 已采納 2017-05-11 16:37:40

解決方案2 0 2017-05-11 12:39:48

解決方案1
1 已采納 2017-05-11 16:37:40

解決方案2
0 2017-05-11 12:39:48