BeautifulSoup：从锚标签中提取文本

Question

I want to extract:我想提取：

text from following src of the image tag and来自image标签的以下 src 的文本和
text of the anchor tag which is inside the div class data div class 数据内的锚标记文本

I successfully manage to extract the img src, but am having trouble extracting the text from the anchor tag.我成功地提取了 img src，但无法从锚标记中提取文本。

<a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&amp;ie=UTF8&amp;qid=1343628292&amp;sr=1-1&amp;keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a>

Here is the link for the entire HTML page .这是整个HTML 页面的链接。

Here is my code:这是我的代码：

for div in soup.findAll('div', attrs={'class':'image'}):
    print "\n"
    for data in div.findNextSibling('div', attrs={'class':'data'}):
        for a in data.findAll('a', attrs={'class':'title'}):
            print a.text
    for img in div.findAll('img'):
        print img['src']

What I am trying to do is extract the image src (link) and the title inside the div class=data , so for example:我要做的是提取图像 src （链接）和div class=data中的标题，例如：

 <a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&amp;ie=UTF8&amp;qid=1343628292&amp;sr=1-1&amp;keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a>

should extract:应该提取：

Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)

Answer 1

I want to extract:我要提取：

text from following src of the image tag and来自image标签的src之后的文本，以及
text of the anchor tag which is inside the div class data div类数据内的定位标记的文本

I successfully manage to extract the img src, but am having trouble extracting the text from the anchor tag.我成功地提取了img src，但是在从定位标记中提取文本时遇到了麻烦。

<a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&amp;ie=UTF8&amp;qid=1343628292&amp;sr=1-1&amp;keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a>

Here is the link for the entire HTML page .这是整个HTML页面的链接。

Here is my code:这是我的代码：

for div in soup.findAll('div', attrs={'class':'image'}):
    print "\n"
    for data in div.findNextSibling('div', attrs={'class':'data'}):
        for a in data.findAll('a', attrs={'class':'title'}):
            print a.text
    for img in div.findAll('img'):
        print img['src']

What I am trying to do is extract the image src (link) and the title inside the div class=data , so for example:我想做的是提取图像src（链接）和div class=data内的标题，因此例如：

 <a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&amp;ie=UTF8&amp;qid=1343628292&amp;sr=1-1&amp;keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a>

should extract:应该提取：

Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)

Answer 2

I want to extract:我要提取：

text from following src of the image tag and来自image标签的src之后的文本，以及
text of the anchor tag which is inside the div class data div类数据内的定位标记的文本

I successfully manage to extract the img src, but am having trouble extracting the text from the anchor tag.我成功地提取了img src，但是在从定位标记中提取文本时遇到了麻烦。

<a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&amp;ie=UTF8&amp;qid=1343628292&amp;sr=1-1&amp;keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a>

Here is the link for the entire HTML page .这是整个HTML页面的链接。

Here is my code:这是我的代码：

for div in soup.findAll('div', attrs={'class':'image'}):
    print "\n"
    for data in div.findNextSibling('div', attrs={'class':'data'}):
        for a in data.findAll('a', attrs={'class':'title'}):
            print a.text
    for img in div.findAll('img'):
        print img['src']

What I am trying to do is extract the image src (link) and the title inside the div class=data , so for example:我想做的是提取图像src（链接）和div class=data内的标题，因此例如：

 <a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&amp;ie=UTF8&amp;qid=1343628292&amp;sr=1-1&amp;keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a>

should extract:应该提取：

Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)

Answer 3

I want to extract:我要提取：

text from following src of the image tag and来自image标签的src之后的文本，以及
text of the anchor tag which is inside the div class data div类数据内的定位标记的文本

I successfully manage to extract the img src, but am having trouble extracting the text from the anchor tag.我成功地提取了img src，但是在从定位标记中提取文本时遇到了麻烦。

<a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&amp;ie=UTF8&amp;qid=1343628292&amp;sr=1-1&amp;keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a>

Here is the link for the entire HTML page .这是整个HTML页面的链接。

Here is my code:这是我的代码：

for div in soup.findAll('div', attrs={'class':'image'}):
    print "\n"
    for data in div.findNextSibling('div', attrs={'class':'data'}):
        for a in data.findAll('a', attrs={'class':'title'}):
            print a.text
    for img in div.findAll('img'):
        print img['src']

What I am trying to do is extract the image src (link) and the title inside the div class=data , so for example:我想做的是提取图像src（链接）和div class=data内的标题，因此例如：

 <a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&amp;ie=UTF8&amp;qid=1343628292&amp;sr=1-1&amp;keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a>

should extract:应该提取：

Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)

Answer 4

I want to extract:我要提取：

text from following src of the image tag and来自image标签的src之后的文本，以及
text of the anchor tag which is inside the div class data div类数据内的定位标记的文本

I successfully manage to extract the img src, but am having trouble extracting the text from the anchor tag.我成功地提取了img src，但是在从定位标记中提取文本时遇到了麻烦。

<a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&amp;ie=UTF8&amp;qid=1343628292&amp;sr=1-1&amp;keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a>

Here is the link for the entire HTML page .这是整个HTML页面的链接。

Here is my code:这是我的代码：

for div in soup.findAll('div', attrs={'class':'image'}):
    print "\n"
    for data in div.findNextSibling('div', attrs={'class':'data'}):
        for a in data.findAll('a', attrs={'class':'title'}):
            print a.text
    for img in div.findAll('img'):
        print img['src']

What I am trying to do is extract the image src (link) and the title inside the div class=data , so for example:我想做的是提取图像src（链接）和div class=data内的标题，因此例如：

 <a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&amp;ie=UTF8&amp;qid=1343628292&amp;sr=1-1&amp;keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a>

should extract:应该提取：

Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)

Answer 5

To get the href out of an anchor tag use tag.get("href") and to get the img src you use tag.img.get("src") .要从锚标记中获取 href，请使用tag.get("href")并获取 img src，请使用tag.img.get("src") 。

Example, using this data:例如，使用此数据：

data = """
            <div class="image">
            <a href="http://www.example.com/eg1">Content1<img src="http://image.example.com/img1.jpg" /></a>
            </div>
            <div class="image">
            <a href="http://www.example.com/eg2">Content2<img src="http://image.example.com/img2.jpg" /> </a>
            </div>
        """

Get the links and texts:获取链接和文本：

import requests
from bs4 import BeautifulSoup

def get_soup(url):
    response = requests.get(url)
    if response.ok:
        return BeautifulSoup(response.text, features="html.parser")

def get_links(soup):
    links = []
    for tag in soup.findAll("a", href=True):
        if img := tag.img:
            img = img.get("src")
        links.append(dict(url=tag.get("href"), text=tag.text, img=img))
    return links

# soup = get_soup('www.example.com')
soup = BeautifulSoup(data, features="html.parser")
links = get_links(soup)

Outputs:输出：

[{'url': 'http://www.example.com/eg1', 'text': 'Content1', 'img': 'http://image.example.com/img1.jpg'},
{'url': 'http://www.example.com/eg2', 'text': 'Content2 ', 'img': 'http://image.example.com/img2.jpg'}]

Answer 6

I want to extract:我要提取：

text from following src of the image tag and来自image标签的src之后的文本，以及
text of the anchor tag which is inside the div class data div类数据内的定位标记的文本

I successfully manage to extract the img src, but am having trouble extracting the text from the anchor tag.我成功地提取了img src，但是在从定位标记中提取文本时遇到了麻烦。

<a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&amp;ie=UTF8&amp;qid=1343628292&amp;sr=1-1&amp;keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a>

Here is the link for the entire HTML page .这是整个HTML页面的链接。

Here is my code:这是我的代码：

for div in soup.findAll('div', attrs={'class':'image'}):
    print "\n"
    for data in div.findNextSibling('div', attrs={'class':'data'}):
        for a in data.findAll('a', attrs={'class':'title'}):
            print a.text
    for img in div.findAll('img'):
        print img['src']

What I am trying to do is extract the image src (link) and the title inside the div class=data , so for example:我想做的是提取图像src（链接）和div class=data内的标题，因此例如：

 <a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&amp;ie=UTF8&amp;qid=1343628292&amp;sr=1-1&amp;keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a>

should extract:应该提取：

Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)

Answer 7

I want to extract:我要提取：

text from following src of the image tag and来自image标签的src之后的文本，以及
text of the anchor tag which is inside the div class data div类数据内的定位标记的文本

I successfully manage to extract the img src, but am having trouble extracting the text from the anchor tag.我成功地提取了img src，但是在从定位标记中提取文本时遇到了麻烦。

<a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&amp;ie=UTF8&amp;qid=1343628292&amp;sr=1-1&amp;keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a>

Here is the link for the entire HTML page .这是整个HTML页面的链接。

Here is my code:这是我的代码：

for div in soup.findAll('div', attrs={'class':'image'}):
    print "\n"
    for data in div.findNextSibling('div', attrs={'class':'data'}):
        for a in data.findAll('a', attrs={'class':'title'}):
            print a.text
    for img in div.findAll('img'):
        print img['src']

What I am trying to do is extract the image src (link) and the title inside the div class=data , so for example:我想做的是提取图像src（链接）和div class=data内的标题，因此例如：

 <a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&amp;ie=UTF8&amp;qid=1343628292&amp;sr=1-1&amp;keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a>

should extract:应该提取：

Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)

Answer 8

soup.find('a', attrs={'class':'class_name'}).text

BeautifulSoup：从锚标签中提取文本

问题描述

8 个解决方案

解决方案1
68 2012-07-30 12:00:42

解决方案2
27 2015-12-01 11:37:00

解决方案3
7 2012-07-30 13:19:15

解决方案4
3 已采纳 2012-07-30 21:40:39

解决方案5
2 2021-05-07 11:07:52

解决方案6
1 2012-07-30 17:57:38

解决方案7
0 2020-12-16 12:21:15

解决方案8
0 2022-09-24 16:56:42

BeautifulSoup：从锚标签中提取文本

问题描述

8 个解决方案

解决方案1 68 2012-07-30 12:00:42

解决方案2 27 2015-12-01 11:37:00

解决方案3 7 2012-07-30 13:19:15

解决方案4 3 已采纳 2012-07-30 21:40:39

解决方案5 2 2021-05-07 11:07:52

解决方案6 1 2012-07-30 17:57:38

解决方案7 0 2020-12-16 12:21:15

解决方案8 0 2022-09-24 16:56:42

解决方案1
68 2012-07-30 12:00:42

解决方案2
27 2015-12-01 11:37:00

解决方案3
7 2012-07-30 13:19:15

解决方案4
3 已采纳 2012-07-30 21:40:39

解决方案5
2 2021-05-07 11:07:52

解决方案6
1 2012-07-30 17:57:38

解决方案7
0 2020-12-16 12:21:15

解决方案8
0 2022-09-24 16:56:42