简体   繁体   English

无法推动所生成的下一页的链接以递归方式进行爬网

[英]Can't propel the generated links for next page to crawl recursively

The crawler I have created are fetching names and urls from a webpage. 我创建的搜寻器正在从网页中获取名称和网址。 Now, I can't get any idea to make my crawler use the links generated by the next_page to fetch data from next page. 现在,我完全不知道让我的搜寻器使用next_page生成的链接来从下一页获取数据。 I'm very new to creating crawler using class that is because I can't move further with my thinking. 对于使用类创建搜寻器,我是一个新手,这是因为我无法进一步思考。 I've already taken an initiative to do a slight twist in my code but it neither brings any result nor throws any error. 我已经采取主动对代码稍作改动,但是它既不会带来任何结果,也不会引发任何错误。 Hope somebody will take a look into it. 希望有人来看看它。

import requests
from lxml import html

class wiseowl:
    def __init__(self,start_url):
        self.start_url=start_url
        self.storage=[]

    def crawl(self):
        self.get_link(self.start_url)

    def get_link(self,link):
        url="http://www.wiseowl.co.uk"
        response=requests.get(link)
        tree=html.fromstring(response.text)
        name=tree.xpath("//p[@class='woVideoListDefaultSeriesTitle']/a/text()")
        urls=tree.xpath("//p[@class='woVideoListDefaultSeriesTitle']/a/@href")
        docs=(name,urls)
        self.storage.append(docs)

        next_page=tree.xpath("//div[contains(concat(' ', @class, ' '), ' woPaging ')]//a[@class='woPagingItem']/@href")
        for npage in next_page:
            if npage is not None:
                self.get_link(url+npage)


    def __str__(self):
        return "{}".format(self.storage)


crawler=wiseowl("http://www.wiseowl.co.uk/videos/")
crawler.crawl()
for item in crawler.storage:
    print(item)

I modified some parts of your class, give it a try : 我修改了课程的某些部分,尝试一下:

class wiseowl:
    def __init__(self,start_url):
        self.start_url=start_url
        self.links = [ self.start_url ]    #  a list of links to crawl # 
        self.storage=[]

    def crawl(self): 
        for link in self.links :    # call get_link for every link in self.links #
            self.get_link(link)

    def get_link(self,link):
        print('Crawling: ' + link)
        url="http://www.wiseowl.co.uk"
        response=requests.get(link)
        tree=html.fromstring(response.text)
        name=tree.xpath("//p[@class='woVideoListDefaultSeriesTitle']/a/text()")
        urls=tree.xpath("//p[@class='woVideoListDefaultSeriesTitle']/a/@href")
        docs=(name,urls)
        #docs=(name, [url+u for u in urls])    # use this line if you want to join the urls # 
        self.storage.append(docs)
        next_page=tree.xpath("//div[contains(concat(' ', @class, ' '), ' woPaging ')]//*[@class='woPagingItem' or @class='woPagingNext']/@href")    # get links form 'woPagingItem' or 'woPagingNext' # 
        for npage in next_page:
            if npage and url+npage not in self.links :    # don't get the same link twice # 
                self.links += [ url+npage ]

    def __str__(self):
        return "{}".format(self.storage)

crawler=wiseowl("http://www.wiseowl.co.uk/videos/")
crawler.crawl()
for item in crawler.storage:
    item = zip(item[0], item[1])
    for i in item : 
        print('{:60} {}'.format(i[0], i[1]))    # you can change 60 to the value you want # 

You should think about utilizing some type of data structure to hold both visited links (to avoid infinite loops) as well as a container for links you have yet to visit. 您应该考虑利用某种数据结构来保存访问过的链接(以避免无限循环)以及尚未访问的链接的容器。 Crawling is essentially a breadth first search of the internet. 从本质上讲,爬网是互联网的广度优先搜索。 So, you should google breadth first search to gain a better understanding of the underlying algorithm. 因此,您应该使用Google广度优先搜索来更好地了解底层算法。

  1. Implement a queue for links you need to visit. 为需要访问的链接实现队列。 Every time you visit a link, scrape the page for all links and enqueue each one. 每次您访问链接时,请刮除所有链接的页面并使其入队。
  2. Implement a set in Python, or a dictionary, to check whether each link you enqueue has already been visited, if it has been visited, do not enqueue it. 用Python或字典实现一个集合,以检查您已入队的每个链接是否已被访问,如果已被访问,则不要将其入队。
  3. Your crawler method should be something like: 您的搜寻器方法应类似于:

     def crawler(self): while len(self.queue): curr_link = self.queue.pop(0) # process curr_link here -> scrape and add more links to queue # mark curr_link as visited 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM