[英]Extract all links from a web page using python
在 Udacity 的计算机科学课程简介之后,我正在尝试制作一个 python 脚本来从页面中提取链接,下面是我使用的代码:
我收到以下错误
NameError:未定义名称“页面”
这是代码:
def get_page(page):
try:
import urllib
return urllib.urlopen(url).read()
except:
return ''
start_link = page.find('<a href=')
start_quote = page.find('"', start_link)
end_quote = page.find('"', start_quote + 1)
url = page[start_quote + 1:end_quote]
def get_next_target(page):
start_link = page.find('<a href=')
if start_link == -1:
return (None, 0)
start_quote = page.find('"', start_link)
end_quote = page.find('"', start_quote + 1)
url = page[start_quote + 1:end_quote]
return (url, end_quote)
(url, end_pos) = get_next_target(page)
page = page[end_pos:]
def print_all_links(page):
while True:
(url, end_pos) = get_next_target(page)
if url:
print(url)
page = page[:end_pos]
else:
break
print_all_links(get_page("http://xkcd.com/"))
page
未定义,这是错误的原因。
对于这样的网络抓取,您可以简单地使用beautifulSoup
:
from bs4 import BeautifulSoup, SoupStrainer
import requests
url = "http://stackoverflow.com/"
page = requests.get(url)
data = page.text
soup = BeautifulSoup(data)
for link in soup.find_all('a'):
print(link.get('href'))
您可以在htmlpage
中找到具有包含http属性的标签的所有实例。 这可以使用BeautifulSoup
中的find_all
方法并传递attrs={'href': re.compile("http")}
来实现
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(htmlpage, 'html.parser')
links = []
for link in soup.find_all(attrs={'href': re.compile("http")}):
links.append(link.get('href'))
print(links)
我在这里有点晚了,但这是从给定页面上获取链接的一种方法:
from html.parser import HTMLParser
import urllib.request
class LinkScrape(HTMLParser):
def handle_starttag(self, tag, attrs):
if tag == 'a':
for attr in attrs:
if attr[0] == 'href':
link = attr[1]
if link.find('http') >= 0:
print('- ' + link)
if __name__ == '__main__':
url = input('Enter URL > ')
request_object = urllib.request.Request(url)
page_object = urllib.request.urlopen(request_object)
link_parser = LinkScrape()
link_parser.feed(page_object.read().decode('utf-8'))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.