[英]Python + BeautifulSoup: How to get ‘href’ attribute of ‘a’ element?
I have the following:我有以下内容:
html =
'''<div class=“file-one”>
<a href=“/file-one/additional” class=“file-link">
<h3 class=“file-name”>File One</h3>
</a>
<div class=“location”>
Down
</div>
</div>'''
And would like to get just the text of href
which is /file-one/additional
.并且只想获取
href
的文本,即/file-one/additional
。 So I did:所以我做了:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
link_text = “”
for a in soup.find_all(‘a’, href=True, text=True):
link_text = a[‘href’]
print “Link: “ + link_text
But it just prints a blank, nothing.但它只是打印一个空白,什么都没有。 Just
Link:
.只是
Link:
。 So I tested it out on another site but with a different HTML, and it worked.所以我在另一个网站上测试了它,但使用了不同的 HTML,它工作正常。
What could I be doing wrong?我可能做错了什么? Or is there a possibility that the site intentionally programmed to not return the
href
?还是该网站有可能故意编程为不返回
href
?
Thank you in advance and will be sure to upvote/accept answer!提前谢谢您,一定会支持/接受答案!
The 'a' tag in your html does not have any text directly, but it contains a 'h3' tag that has text.您的 html 中的“a”标签没有直接的任何文本,但它包含一个带有文本的“h3”标签。 This means that
text
is None, and .find_all()
fails to select the tag.这意味着
text
为 None,并且.find_all()
无法选择标签。 Generally do not use the text
parameter if a tag contains any other html elements except text content.如果标签包含除文本内容之外的任何其他 html 元素,通常不要使用
text
参数。
You can resolve this issue if you use only the tag's name (and the href
keyword argument) to select elements.如果您仅使用标签的名称(和
href
关键字参数)来选择元素,则可以解决此问题。 Then add a condition in the loop to check if they contain text.然后在循环中添加一个条件来检查它们是否包含文本。
soup = BeautifulSoup(html, 'html.parser')
links_with_text = []
for a in soup.find_all('a', href=True):
if a.text:
links_with_text.append(a['href'])
Or you could use a list comprehension, if you prefer one-liners.或者,如果您更喜欢单行,您可以使用列表推导。
links_with_text = [a['href'] for a in soup.find_all('a', href=True) if a.text]
Or you could pass a lambda
to .find_all()
.或者您可以将
lambda
传递给.find_all()
。
tags = soup.find_all(lambda tag: tag.name == 'a' and tag.get('href') and tag.text)
If you want to collect all links whether they have text or not, just select all 'a' tags that have a 'href' attribute.如果要收集所有链接,无论它们是否包含文本,只需选择所有具有“href”属性的“a”标签。 Anchor tags usually have links but that's not a requirement, so I think it's best to use the
href
argument.锚标签通常有链接,但这不是必需的,所以我认为最好使用
href
参数。
Using .find_all()
.使用
.find_all()
。
links = [a['href'] for a in soup.find_all('a', href=True)]
Using .select()
with CSS selectors.将
.select()
与 CSS 选择器一起使用。
links = [a['href'] for a in soup.select('a[href]')]
您还可以使用 attrs 通过正则表达式搜索来获取 href 标签
soup.find('a', href = re.compile(r'[/]([a-z]|[A-Z])\w+')).attrs['href']
First of all, use a different text editor that doesn't use curly quotes.首先,使用不使用大引号的其他文本编辑器。
Second, remove the text=True
flag from the soup.find_all
其次,从
soup.find_all
中删除text=True
标志
You could solve this with just a couple lines of gazpacho :你可以用几行gazpacho来解决这个问题:
from gazpacho import Soup
html = """\
<div class="file-one">
<a href="/file-one/additional" class="file-link">
<h3 class="file-name">File One</h3>
</a>
<div class="location">
Down
</div>
</div>
"""
soup = Soup(html)
soup.find("a", {"class": "file-link"}).attrs['href']
Which would output:这将输出:
'/file-one/additional'
A bit late to the party but I had the same issue recently scraping some recipes and got mine printing clean by doing this:聚会有点晚了,但我最近遇到了同样的问题,我通过这样做刮掉了一些食谱并让我的印刷干净:
from bs4 import BeautifulSoup
import requests
source = requests.get('url for website')
soup = BeautifulSoup(source, 'lxml')
for article in soup.find_all('article'):
link = article.find('a', href=True)['href'}
print(link)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.