简体   繁体   English

如何从网站上使用Beautiful Soup(python)递归获取所有链接

[英]How to get all links from website using Beautiful Soup (python) Recursively

I want to be able to recursively get all links from a website then follow those links and get all links from those websites. 我希望能够以递归方式从网站获取所有链接,然后按照这些链接获取这些网站的所有链接。 The depth should be 5-10 so that it returns back a an array of all links that it finds. 深度应为5-10,以便返回它找到的所有链接的数组。 Preferably using beautiful soup/python. 最好使用美丽的汤/蟒蛇。 Thanks! 谢谢!

I have tried this so far and it is not working....any help will be appreciated. 到目前为止,我已经尝试了这个并且它不起作用....任何帮助将不胜感激。

from BeautifulSoup import BeautifulSoup
import urllib2

def getLinks(url):
    if (len(url)==0):
        return [url]
    else:
        files = [ ]
        page=urllib2.urlopen(url)
        soup=BeautifulSoup(page.read())
        universities=soup.findAll('a',{'class':'institution'})
        for eachuniversity in universities:
           files+=getLinks(eachuniversity['href'])
        return files

print getLinks("http://www.utexas.edu/world/univ/alpha/")

the number of crawling page will grow exponentially, there are many issues involved that might not look complicated in first look, check out scrapy architecture overview to get a sense of how it should be done in real life 抓取页面的数量将成倍增长,涉及的许多问题在初看起来可能看起来并不复杂,请查看scrapy架构概述,以了解在现实生活中应该如何完成

在此输入图像描述

among other great features scrapy will not repeat crawling same pages (unless you'll force it to) and can be configured for maximum DEPTH_LIMIT 除了其他强大的功能之外,scrapy不会重复抓取相同的页面(除非你强制它)并且可以配置为最大DEPTH_LIMIT

even better yet, scrapy has a built in link extraction tools link-extractors 更好的是,scrapy有一个内置的链接提取工具链接提取器

Recursive algorithms are used to reduce big problems to smaller ones which have the same structure and then combine the results. 递归算法用于将大问题减少到具有相同结构的较小问题,然后将结果组合。 They are often composed by a base case which doesn't lead to recursion and another case that leads to recursion. 它们通常由不会导致递归的基本情况和导致递归的另一种情况组成。 For example, say you were born at 1986 and you want to calculate your age. 例如,假设您出生于1986年,并且您想要计算您的年龄。 You could write: 你可以写:

def myAge(currentyear):
    if currentyear == 1986: #Base case, does not lead to recursion.
        return 0
    else:                   #Leads to recursion
        return 1+myAge(currentyear-1)

I, myself, don't really see the point in using recursion in your problem. 我,我自己,并没有真正看到在你的问题中使用递归的重点。 My suggestion is first that you put a limit in your code. 我的建议是你首先在代码中设置一个限制。 What you gave us will just run infinately, because the program gets stuck in infinately nested for loops; 你给我们的东西只会无限地运行,因为程序会陷入无限嵌套的for循环中; it never reaches an end and starts returning. 它永远不会达到目的并开始回归。 So you can have a variable outside the function that updates every time you go down a level and at a certain point stops the function from starting a new for-loop and starts returning what it has found. 因此,您可以在函数外部使用一个变量,每次进入某个级别时都会更新,并且在某个时刻停止该函数启动新的for循环并开始返回它找到的内容。

But then you are getting into changing global variables, you are using recursion in a strange way and the code gets messy. 但是你开始改变全局变量,你以一种奇怪的方式使用递归,代码变得混乱。

Now reading the comments and seeng what you really want, which, I must say, is not really clear, you can use help from a recursive algorithm in your code, but not write all of it recursively. 现在阅读注释并选择你真正想要的东西,我必须说,这不是很清楚,你可以在代码中使用递归算法的帮助,但不能递归地写出所有这些。

def recursiveUrl(url,depth):

    if depth == 5:
        return url
    else:
        page=urllib2.urlopen(url)
        soup = BeautifulSoup(page.read())
        newlink = soup.find('a') #find just the first one
        if len(newlink) == 0:
            return url
        else:
            return url, recursiveUrl(newlink,depth+1)


def getLinks(url):
    page=urllib2.urlopen(url)
    soup = BeautifulSoup(page.read())
    links = soup.find_all('a', {'class':'institution'})
    for link in links:
        links.append(recursiveUrl(link,0))
    return links

Now there is still a problem with this: links are not always linked to webpages, but also to files and images. 现在仍然存在这样的问题:链接并不总是链接到网页,而是链接到文件和图像。 That's why I wrote the if/else statement in the recursive part of the 'url-opening' function. 这就是我在'url-opening'函数的递归部分编写if / else语句的原因。 The other problem is that your first website has 2166 institution links, and creating 2166*5 beautifulSoups is not fast. 另一个问题是你的第一个网站有2166个机构链接,创建2166 * 5个漂亮的东西并不快。 The code above runs a recursive function 2166 times. 上面的代码运行2166次递归函数。 That shouldn't be a problem but you are dealing with big html(or php whatever) files so making a soup of 2166*5 takes a huge amount of time. 这不应该是一个问题,但你正在处理大的HTML(或任何PHP)文件,所以制作2166 * 5的汤需要花费大量的时间。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用Beautiful Soup在Python中递归地删除网站的所有子链接 - Scrape all of sublinks of a website recursively in Python using Beautiful Soup 使用python /美丽的汤从网站上抓取Kodi插件的链接。 - Scraping links from a website using python / beautiful soup for a Kodi addon 如何使用python美丽汤将所有页面的所有链接保存到csv - How to save all links from all pages to csv using python beautiful soup 如何使用美丽的汤从 html 页面获取链接 url - How to get links urls from a html page using Beautiful soup 如何使用 Beautiful Soup 从网站获取值和项目名称 - How to get values and item name from website using Beautiful Soup 使用 Python 的 Beautiful Soup 模块从网站获取 href 链接 - Getting href links from a website using Python's Beautiful Soup module 如何通过使用漂亮的汤和python获得favicon - How to get favicon by using beautiful soup and python 如何使用 Beautiful Soup 查找嵌套列表中的所有链接 - how to find all links inside nested lists using Beautiful Soup 如何使用漂亮的汤禁用列表中没有的所有链接 - How to disable all links not in a list, using beautiful soup 如何在Python中使用Beautiful Soup查找标签中包含2个或3个单词的所有链接 - How to find all links that contain 2 or 3 words in tag using Beautiful Soup in Python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM