Python Beautifulsoup，获取href标签，在一个标签中

Question

i had problem with getting href tag, so my case like this, this is the html file:我在获取href标签时遇到了问题，所以我的情况是这样的，这是html文件：

<div class="list-product with-sidebar">
 <a class="frame-item" href="./produk-a.html" target="_blank" title="Produk A">

 </a>
 <a class="frame-item" href="./produk-b.html" target="_blank" title="Produk B">

 </a>
</div>

so here my code所以这里是我的代码

    def get_category_item_list(category):
        base_url = 'https://www.website.com/'
        res = session.get(base_url+category)
        res = BeautifulSoup(res.content, 'html.parser')
        all_title = res.findAll('a', attrs={'class':'frame-item'})
        data_titles = []
        for title in all_title:
            product_link = title.get('a')['href']
            data_titles.append(product_link)
        return data_titles

what i want to get is, href links.. like this我想得到的是， href链接..像这样

produk-a.html
produk-b.html

when i try to run it.. it wont let me get link on href , they give error code:当我尝试运行它时..它不会让我获得href上的链接，它们会给出错误代码：

TypeError: 'NoneType' object is not subscriptable

Answer 1

I believe that your problem lies in this line:我相信你的问题在于这一行：

product_link = title.get('a')['href']

You already have a list of "a" elements, so you probably just need:您已经有一个“a”元素列表，因此您可能只需要：

product_link = title['href']

Answer 2

You didn't share with us the website, so one problem might be that the website block User Agents that looks like a bot (requests's user agent).您没有与我们共享该网站，因此一个问题可能是该网站阻止了看起来像机器人的用户代理（请求的用户代理）。 Debugging may help here, you can print the content of the page with resp.content/text .调试可能会有所帮助，您可以使用resp.content/text打印页面的内容。

I created an HTML file called index.html and then I read the file and scrape it's content.我创建了一个名为index.html的 HTML 文件，然后我读取了该文件并抓取了它的内容。 I changed a little bit the code and it seems to be work fine.我稍微更改了代码，它似乎工作正常。

soup.find returns an <class 'bs4.element.Tag'> , so you can access it's attributes with attribute['a'] . soup.find返回一个<class 'bs4.element.Tag'> ，因此您可以使用attribute['a']访问它的属性。

from bs4 import BeautifulSoup

with open('index.html') as f:
    html_content = f.read()

soup = BeautifulSoup(html_content, 'html.parser')
data_titles = []
for a in soup.find('div', class_='list-product with-sidebar').find_all('a'):
    data_titles.append(a['href'].split('/')[1])
print(data_titles)
# ['produk-a.html', 'produk-b.html']

index.html索引.html

<!DOCTYPE html>
<html lang="en">
    <head>
        <meta charset="UTF-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1.0" />
        <title>Document</title>
    </head>
    <body>
        <div class="list-product with-sidebar">
            <a
                class="frame-item"
                href="./produk-a.html"
                target="_blank"
                title="Produk A"
            >
            </a>
            <a
                class="frame-item"
                href="./produk-b.html"
                target="_blank"
                title="Produk B"
            >
            </a>
        </div>
    </body>
</html>

Answer 3

For your exact output,对于您的确切 output，

you are already iterating over anchor tags您已经在迭代锚标签
you would need to split by "/" and choose the last element你需要用“/”分割并选择最后一个元素

from bs4 import BeautifulSoup


html = """<div class="list-product with-sidebar">
 <a class="frame-item" href="./produk-a.html" target="_blank" title="Produk A">

 </a>
 <a class="frame-item" href="./produk-b.html" target="_blank" title="Produk B">

 </a>
</div>"""

res = BeautifulSoup(html, 'html.parser')

for a in res.findAll('a', attrs={'class':'frame-item'}):
    print(a["href"].split("/")[-1])

Output: Output：

produk-a.html
produk-b.html

Python Beautifulsoup，获取href标签，在一个标签中

问题描述

3 个解决方案

解决方案1
4 已采纳 2020-08-17 10:40:00

解决方案2
2 2020-08-17 10:44:02

解决方案3
2 2020-08-17 10:47:42

Python Beautifulsoup，获取href标签，在一个标签中

问题描述

3 个解决方案

解决方案1 4 已采纳 2020-08-17 10:40:00

解决方案2 2 2020-08-17 10:44:02

解决方案3 2 2020-08-17 10:47:42

解决方案1
4 已采纳 2020-08-17 10:40:00

解决方案2
2 2020-08-17 10:44:02

解决方案3
2 2020-08-17 10:47:42