简体   繁体   English

如何获取<a>在 python 中使用 BeautifulSoup 的 href 属性中的数据?</a>

[英]how can i get data that is in href attribute of <a> using BeautifulSoup in python?

import requests
from bs4 import BeautifulSoup

url = 'https://www.maritimecourier.com/restaurant'

headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '\
       'AppleWebKit/537.36 (KHTML, like Gecko) '\
       'Chrome/75.0.3770.80 Safari/537.36'}

response = requests.get(url,headers=headers)
soup = BeautifulSoup(response.text, "html.parser")

test = soup.select('.underline-body-links .sqs-block a, .underline-body-links .entry- 
content a, .underline-body-links .eventlist-excerpt a, .underline-body-links 
.playlist-description a, .underline-body-links .image-description a, .underline-body- 
links .sqs-block a:visited, .underline-body-links .entry-content a:visited, 
.underline-body-links .eventlist-excerpt a:visited, .underline-body-links .playlist- 
description a:visited, .underline-body-links .image-description a:visited')
test

With this code I get this output使用此代码,我得到了这个 output

[<a href="https://www.instagram.com/breakfast_dreams/" target="_blank">Breakfast Dreams</a>,
 <a href="https://www.maritimecourier.com/breakfast-dreams" target="_blank">MARITIME</a>,
 <a href="https://www.instagram.com/latarantellalb/" target="_blank">La Tarantella</a>]

Now, I am trying to get the URL and the name from the a tag现在,我正在尝试从 a 标签中获取 URL 和名称

I would like to know how can I do this.我想知道我该怎么做。 So far I tried with this:到目前为止,我试过这个:

results = []

for restaurant in soup.select('.underline-body-links .sqs-block a, .underline-body-links .entry-content a, .underline-body-links .eventlist-excerpt a, .underline-body-links .playlist-description a, .underline-body-links .image-description a, .underline-body-links .sqs-block a:visited, .underline-body-links .entry-content a:visited, .underline-body-links .eventlist-excerpt a:visited, .underline-body-links .playlist-description a:visited, .underline-body-links .image-description a:visited'):
    results.append({
        'title':restaurant.find('a',{'target':'_blank'}).text
    })
results

But I got this但我得到了这个

'NoneType' object has no attribute 'text'

Your selection is not quiet clear and also the expected output - Main issue is that you still selected the <a> s and try to find an <a> in an <a> .您的选择不是很清楚,也不是预期的 output - 主要问题是您仍然选择了<a>并尝试在<a>中找到<a>

So your extraction part should more look like this:所以你的提取部分应该更像这样:

results.append({
    'title': restaurant.text,
    'url': restaurant.get('href')
})

You could also make your selection more specific:您还可以使您的选择更具体:

[{'title':a.text, 'url':a.get('href')} for a in soup.select('.sqs-block-content a')]

or with out all the internal links:或者没有所有内部链接:

 [{'title':a.text, 'url':a.get('href')} for a in soup.select('.sqs-block-content a') if 'maritimecourier' not in a.get('href')]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Python + BeautifulSoup:如何从 href 属性获取完整链接? - Python + BeautifulSoup: How can I get full link from href attribute? Python + BeautifulSoup:如何获取“a”元素的“href”属性? - Python + BeautifulSoup: How to get ‘href’ attribute of ‘a’ element? 我怎样才能得到标题名称<a href=" " title=""></a>使用 BeautifulSoup - How can I get title name in <a href=' ' title = ' '></a> using BeautifulSoup 我如何在beautifulsoup中获得href标签? - how can i get the href tag in beautifulsoup? 如果同一行中有两个href,我如何使用beautifulsoup只获得一个href? - how can i get only one of the href using beautifulsoup if there are two href in the same line? 如何使用 BeautifulSoup 获取标签属性名称 - How can I get the tag attribute name using BeautifulSoup 如何使用 BeautifulSoup 和 Python 获取属性值? - How to get an attribute value using BeautifulSoup and Python? 如何从<a href>标签中</a>获取信息<div> <a href>BeautifulSoup 和 Python 的标签?</a> - How can I get information from an <a href> tag within <div> tags with BeautifulSoup and Python? 如何在python中使用beautifulsoup获取完整的href链接 - How to get complete href links using beautifulsoup in python 如何使用Python BeautifulSoup从某些html类中获取href - How to get href out of certain html class using Python BeautifulSoup
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM