[英]How to extract href attribute in html source code
This is HTML source code that I am dealing with:这是我正在处理的 HTML 源代码:
<a href="/people/charles-adams" class="gridlist__link">
So what I want to do is to extract the href attribute, in this case would be "/people/charles-adams", with beautifulsoup module.所以我想要做的是提取 href 属性,在这种情况下是“/people/charles-adams”,带有 beautifulsoup 模块。 I need this because I want to get html source code with soup.findAll method for that particular webpage.
我需要这个,因为我想获得 html 源代码与该特定网页的 soup.findAll 方法。 But I am struggling to extract such attribute from the webpage.
但我正在努力从网页中提取此类属性。 Could anyone help me with this problem?
谁能帮我解决这个问题?
PS I am using this method to get html source code with Python module beautifulSoup: PS我正在使用这种方法来获取带有Python模块beautifulSoup的html源代码:
request = requests.get(link, headers=header)
html = request.text
soup = BeautifulSoup(html, 'html.parser')
Try something like:尝试类似:
refs = soup.find_all('a')
for i in refs:
if i.has_attr('href'):
print(i['href'])
It should output:它应该是 output:
/people/charles-adams
You can tell beautifulsoup
to find all anchor tags with soup.find_all('a')
.您可以告诉
beautifulsoup
使用soup.find_all('a')
查找所有锚标签。 Then you can filter it with list comprehension and get the links.然后您可以使用列表理解对其进行过滤并获取链接。
request = requests.get(link, headers=header)
html = request.text
soup = BeautifulSoup(html, 'html.parser')
tags = soup.find_all('a')
tags = [tag for tag in tags if tag.has_attr('href')]
links = [tag['href'] for tag in tags]
links
will be ['/people/charles-adams']
links
将是['/people/charles-adams']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.