简体   繁体   English

从文件中的URL读取内容

[英]Reading in Content From URLS in a File

I'm trying to get other subset URLs from a main URL. 我正在尝试从主URL获取其他子集URL。 However,as I print to see if I get the content, I noticed that I am only getting the HTML, not the URLs within it. 但是,当我打印以查看是否得到内容时,我注意到我只是得到HTML,而不是其中的URL。

import urllib
file = 'http://example.com'

with urllib.request.urlopen(file) as url:
    collection = url.read().decode('UTF-8')

I think this is what you are looking for. 我认为这就是您想要的。 You can use beautiful soup library of python and this code should work with python3 您可以使用python的漂亮汤库,并且此代码应与python3一起使用

    import urllib
    from urllib.request import urlopen
    from bs4 import BeautifulSoup

    def get_all_urls(url):
        open = urlopen(url)
        url_html = BeautifulSoup(open, 'html.parser')
        for link in url_html.find_all('a'):
            links = str(link.get('href'))
            if links.startswith('http'):
                print(links)
            else:
                print(url + str(links))
    get_all_urls('url.com')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM