[英]Reading in Content From URLS in a File
I'm trying to get other subset URLs from a main URL. 我正在尝试从主URL获取其他子集URL。 However,as I print to see if I get the content, I noticed that I am only getting the HTML, not the URLs within it.
但是,当我打印以查看是否得到内容时,我注意到我只是得到HTML,而不是其中的URL。
import urllib
file = 'http://example.com'
with urllib.request.urlopen(file) as url:
collection = url.read().decode('UTF-8')
I think this is what you are looking for. 我认为这就是您想要的。 You can use beautiful soup library of python and this code should work with python3
您可以使用python的漂亮汤库,并且此代码应与python3一起使用
import urllib
from urllib.request import urlopen
from bs4 import BeautifulSoup
def get_all_urls(url):
open = urlopen(url)
url_html = BeautifulSoup(open, 'html.parser')
for link in url_html.find_all('a'):
links = str(link.get('href'))
if links.startswith('http'):
print(links)
else:
print(url + str(links))
get_all_urls('url.com')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.