简体   繁体   English

如何从HTML中提取链接(使用python)

[英]How to extract links from HTML (with python)

so I've downloaded the HTML of a web page. 因此,我已经下载了网页的HTML。 I'm supposed to extract all of the links from the HTML and output them. 我应该从HTML中提取所有链接并输出它们。 Here is my code 这是我的代码

f = open('html.py','r')
heb = f.readlines()
arry = []
if 'href' in heb:
    arry = arry.append(href)

    print(arry)

I'm trying to make a list of the links and output it, but honestly I'm pretty lost. 我正在尝试列出链接并输出,但是说实话我很迷路。 Can someone point me in the right direction? 有人可以指出我正确的方向吗? I was thinking regex probably is the way to go thanks 我在想正则表达式可能是要走的路谢谢

You can use Beautiful Soup (which you'll need to install, eg with pip install BeautifulSoup4 ): 您可以使用Beautiful Soup(需要安装,例如pip install BeautifulSoup4 ):

import bs4

with open("my-file.html") as f:
    soup = bs4.BeautifulSoup(f)

links = [link['href'] for link in soup('a') if 'href' in link.attrs]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM