简体   繁体   English

Python + BeautifulSoup:如何获取“a”元素的“href”属性?

[英]Python + BeautifulSoup: How to get ‘href’ attribute of ‘a’ element?

I have the following:我有以下内容:

  html =
  '''<div class=“file-one”>
    <a href=“/file-one/additional” class=“file-link">
      <h3 class=“file-name”>File One</h3>
    </a>
    <div class=“location”>
      Down
    </div>
  </div>'''

And would like to get just the text of href which is /file-one/additional .并且只想获取href的文本,即/file-one/additional So I did:所以我做了:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

link_text = “”

for a in soup.find_all(‘a’, href=True, text=True):
    link_text = a[‘href’]

print “Link: “ + link_text

But it just prints a blank, nothing.但它只是打印一个空白,什么都没有。 Just Link: .只是Link: So I tested it out on another site but with a different HTML, and it worked.所以我在另一个网站上测试了它,但使用了不同的 HTML,它工作正常。

What could I be doing wrong?我可能做错了什么? Or is there a possibility that the site intentionally programmed to not return the href ?还是该网站有可能故意编程为不返回href

Thank you in advance and will be sure to upvote/accept answer!提前谢谢您,一定会支持/接受答案!

The 'a' tag in your html does not have any text directly, but it contains a 'h3' tag that has text.您的 html 中的“a”标签没有直接的任何文本,但它包含一个带有文本的“h3”标签。 This means that text is None, and .find_all() fails to select the tag.这意味着text为 None,并且.find_all()无法选择标签。 Generally do not use the text parameter if a tag contains any other html elements except text content.如果标签包含除文本内容之外的任何其他 html 元素,通常不要使用text参数。

You can resolve this issue if you use only the tag's name (and the href keyword argument) to select elements.如果您仅使用标签的名称(和href关键字参数)来选择元素,则可以解决此问题。 Then add a condition in the loop to check if they contain text.然后在循环中添加一个条件来检查它们是否包含文本。

soup = BeautifulSoup(html, 'html.parser')
links_with_text = []
for a in soup.find_all('a', href=True): 
    if a.text: 
        links_with_text.append(a['href'])

Or you could use a list comprehension, if you prefer one-liners.或者,如果您更喜欢单行,您可以使用列表推导。

links_with_text = [a['href'] for a in soup.find_all('a', href=True) if a.text]

Or you could pass a lambda to .find_all() .或者您可以将lambda传递给.find_all()

tags = soup.find_all(lambda tag: tag.name == 'a' and tag.get('href') and tag.text)

If you want to collect all links whether they have text or not, just select all 'a' tags that have a 'href' attribute.如果要收集所有链接,无论它们是否包含文本,只需选择所有具有“href”属性的“a”标签。 Anchor tags usually have links but that's not a requirement, so I think it's best to use the href argument.锚标签通常有链接,但这不是必需的,所以我认为最好使用href参数。

Using .find_all() .使用.find_all()

links = [a['href'] for a in soup.find_all('a', href=True)]

Using .select() with CSS selectors..select()与 CSS 选择器一起使用。

links = [a['href'] for a in soup.select('a[href]')]

您还可以使用 attrs 通过正则表达式搜索来获取 href 标签

soup.find('a', href = re.compile(r'[/]([a-z]|[A-Z])\w+')).attrs['href']
  1. First of all, use a different text editor that doesn't use curly quotes.首先,使用不使用大引号的其他文本编辑器。

  2. Second, remove the text=True flag from the soup.find_all其次,从soup.find_all中删除text=True标志

You could solve this with just a couple lines of gazpacho :你可以用几行gazpacho来解决这个问题:


from gazpacho import Soup

html = """\
<div class="file-one">
    <a href="/file-one/additional" class="file-link">
      <h3 class="file-name">File One</h3>
    </a>
    <div class="location">
      Down
    </div>
  </div>
"""

soup = Soup(html)
soup.find("a", {"class": "file-link"}).attrs['href']

Which would output:这将输出:

'/file-one/additional'

A bit late to the party but I had the same issue recently scraping some recipes and got mine printing clean by doing this:聚会有点晚了,但我最近遇到了同样的问题,我通过这样做刮掉了一些食谱并让我的印刷干净:

from bs4 import BeautifulSoup
import requests

source = requests.get('url for website')
soup = BeautifulSoup(source, 'lxml')

for article in soup.find_all('article'):
    link = article.find('a', href=True)['href'}
    print(link)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM