[英]How to get all links from website using Beautiful Soup (python) Recursively
I want to be able to recursively get all links from a website then follow those links and get all links from those websites. 我希望能够以递归方式从网站获取所有链接,然后按照这些链接获取这些网站的所有链接。 The depth should be 5-10 so that it returns back a an array of all links that it finds.
深度应为5-10,以便返回它找到的所有链接的数组。 Preferably using beautiful soup/python.
最好使用美丽的汤/蟒蛇。 Thanks!
谢谢!
I have tried this so far and it is not working....any help will be appreciated. 到目前为止,我已经尝试了这个并且它不起作用....任何帮助将不胜感激。
from BeautifulSoup import BeautifulSoup
import urllib2
def getLinks(url):
if (len(url)==0):
return [url]
else:
files = [ ]
page=urllib2.urlopen(url)
soup=BeautifulSoup(page.read())
universities=soup.findAll('a',{'class':'institution'})
for eachuniversity in universities:
files+=getLinks(eachuniversity['href'])
return files
print getLinks("http://www.utexas.edu/world/univ/alpha/")
the number of crawling page will grow exponentially, there are many issues involved that might not look complicated in first look, check out scrapy architecture overview to get a sense of how it should be done in real life 抓取页面的数量将成倍增长,涉及的许多问题在初看起来可能看起来并不复杂,请查看scrapy架构概述,以了解在现实生活中应该如何完成
among other great features scrapy will not repeat crawling same pages (unless you'll force it to) and can be configured for maximum DEPTH_LIMIT 除了其他强大的功能之外,scrapy不会重复抓取相同的页面(除非你强制它)并且可以配置为最大DEPTH_LIMIT
even better yet, scrapy has a built in link extraction tools link-extractors 更好的是,scrapy有一个内置的链接提取工具链接提取器
Recursive algorithms are used to reduce big problems to smaller ones which have the same structure and then combine the results. 递归算法用于将大问题减少到具有相同结构的较小问题,然后将结果组合。 They are often composed by a base case which doesn't lead to recursion and another case that leads to recursion.
它们通常由不会导致递归的基本情况和导致递归的另一种情况组成。 For example, say you were born at 1986 and you want to calculate your age.
例如,假设您出生于1986年,并且您想要计算您的年龄。 You could write:
你可以写:
def myAge(currentyear):
if currentyear == 1986: #Base case, does not lead to recursion.
return 0
else: #Leads to recursion
return 1+myAge(currentyear-1)
I, myself, don't really see the point in using recursion in your problem. 我,我自己,并没有真正看到在你的问题中使用递归的重点。 My suggestion is first that you put a limit in your code.
我的建议是你首先在代码中设置一个限制。 What you gave us will just run infinately, because the program gets stuck in infinately nested for loops;
你给我们的东西只会无限地运行,因为程序会陷入无限嵌套的for循环中; it never reaches an end and starts returning.
它永远不会达到目的并开始回归。 So you can have a variable outside the function that updates every time you go down a level and at a certain point stops the function from starting a new for-loop and starts returning what it has found.
因此,您可以在函数外部使用一个变量,每次进入某个级别时都会更新,并且在某个时刻停止该函数启动新的for循环并开始返回它找到的内容。
But then you are getting into changing global variables, you are using recursion in a strange way and the code gets messy. 但是你开始改变全局变量,你以一种奇怪的方式使用递归,代码变得混乱。
Now reading the comments and seeng what you really want, which, I must say, is not really clear, you can use help from a recursive algorithm in your code, but not write all of it recursively. 现在阅读注释并选择你真正想要的东西,我必须说,这不是很清楚,你可以在代码中使用递归算法的帮助,但不能递归地写出所有这些。
def recursiveUrl(url,depth):
if depth == 5:
return url
else:
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
newlink = soup.find('a') #find just the first one
if len(newlink) == 0:
return url
else:
return url, recursiveUrl(newlink,depth+1)
def getLinks(url):
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
links = soup.find_all('a', {'class':'institution'})
for link in links:
links.append(recursiveUrl(link,0))
return links
Now there is still a problem with this: links are not always linked to webpages, but also to files and images. 现在仍然存在这样的问题:链接并不总是链接到网页,而是链接到文件和图像。 That's why I wrote the if/else statement in the recursive part of the 'url-opening' function.
这就是我在'url-opening'函数的递归部分编写if / else语句的原因。 The other problem is that your first website has 2166 institution links, and creating 2166*5 beautifulSoups is not fast.
另一个问题是你的第一个网站有2166个机构链接,创建2166 * 5个漂亮的东西并不快。 The code above runs a recursive function 2166 times.
上面的代码运行2166次递归函数。 That shouldn't be a problem but you are dealing with big html(or php whatever) files so making a soup of 2166*5 takes a huge amount of time.
这不应该是一个问题,但你正在处理大的HTML(或任何PHP)文件,所以制作2166 * 5的汤需要花费大量的时间。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.