are there are any way using python to get all links in the web site not only in the web page ? I tried this code but that's give me only links in the web page
import urllib2
import re
#connect to a URL
website = urllib2.urlopen('http://www.example.com/')
#read html code
html = website.read()
#use re.findall to get all the links
links = re.findall('"((http|ftp)s?://.*?)"', html)
print links
Visit recursively the links you have gathered and scrap these pages too:
import urllib2
import re
stack = ['http://www.example.com/']
results = []
while len(stack) > 0:
url = stack.pop()
#connect to a URL
website = urllib2.urlopen(url)
#read html code
html = website.read()
#use re.findall to get all the links
# you should not only gather links with http/ftps but also relative links
# you could use beautiful soup for that (if you want <a> links)
links = re.findall('"((http|ftp)s?://.*?)"', html)
result.extend([link in links if is_not_relative_link(link)])
for link in links:
if link_is_valid(link): #this function has to be written
stack.push(link)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.