简体   繁体   English

BeautifulSoup:从锚标签中提取文本

[英]BeautifulSoup: extract text from anchor tag

I want to extract:我想提取:

  • text from following src of the image tag and来自image标签的以下 src 的文本和
  • text of the anchor tag which is inside the div class data div class 数据内的锚标记文本

I successfully manage to extract the img src, but am having trouble extracting the text from the anchor tag.我成功地提取了 img src,但无法从锚标记中提取文本。

<a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&amp;ie=UTF8&amp;qid=1343628292&amp;sr=1-1&amp;keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a> 

Here is the link for the entire HTML page .这是整个HTML 页面的链接。

Here is my code:这是我的代码:

for div in soup.findAll('div', attrs={'class':'image'}):
    print "\n"
    for data in div.findNextSibling('div', attrs={'class':'data'}):
        for a in data.findAll('a', attrs={'class':'title'}):
            print a.text
    for img in div.findAll('img'):
        print img['src']

What I am trying to do is extract the image src (link) and the title inside the div class=data , so for example:我要做的是提取图像 src (链接)和div class=data中的标题,例如:

 <a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&amp;ie=UTF8&amp;qid=1343628292&amp;sr=1-1&amp;keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a> 

should extract:应该提取:

Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)

I want to extract:我要提取:

  • text from following src of the image tag and来自image标签的src之后的文本,以及
  • text of the anchor tag which is inside the div class data div类数据内的定位标记的文本

I successfully manage to extract the img src, but am having trouble extracting the text from the anchor tag.我成功地提取了img src,但是在从定位标记中提取文本时遇到了麻烦。

<a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&amp;ie=UTF8&amp;qid=1343628292&amp;sr=1-1&amp;keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a> 

Here is the link for the entire HTML page .这是整个HTML页面的链接。

Here is my code:这是我的代码:

for div in soup.findAll('div', attrs={'class':'image'}):
    print "\n"
    for data in div.findNextSibling('div', attrs={'class':'data'}):
        for a in data.findAll('a', attrs={'class':'title'}):
            print a.text
    for img in div.findAll('img'):
        print img['src']

What I am trying to do is extract the image src (link) and the title inside the div class=data , so for example:我想做的是提取图像src(链接)和div class=data内的标题,因此例如:

 <a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&amp;ie=UTF8&amp;qid=1343628292&amp;sr=1-1&amp;keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a> 

should extract:应该提取:

Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)

I want to extract:我要提取:

  • text from following src of the image tag and来自image标签的src之后的文本,以及
  • text of the anchor tag which is inside the div class data div类数据内的定位标记的文本

I successfully manage to extract the img src, but am having trouble extracting the text from the anchor tag.我成功地提取了img src,但是在从定位标记中提取文本时遇到了麻烦。

<a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&amp;ie=UTF8&amp;qid=1343628292&amp;sr=1-1&amp;keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a> 

Here is the link for the entire HTML page .这是整个HTML页面的链接。

Here is my code:这是我的代码:

for div in soup.findAll('div', attrs={'class':'image'}):
    print "\n"
    for data in div.findNextSibling('div', attrs={'class':'data'}):
        for a in data.findAll('a', attrs={'class':'title'}):
            print a.text
    for img in div.findAll('img'):
        print img['src']

What I am trying to do is extract the image src (link) and the title inside the div class=data , so for example:我想做的是提取图像src(链接)和div class=data内的标题,因此例如:

 <a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&amp;ie=UTF8&amp;qid=1343628292&amp;sr=1-1&amp;keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a> 

should extract:应该提取:

Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)

I want to extract:我要提取:

  • text from following src of the image tag and来自image标签的src之后的文本,以及
  • text of the anchor tag which is inside the div class data div类数据内的定位标记的文本

I successfully manage to extract the img src, but am having trouble extracting the text from the anchor tag.我成功地提取了img src,但是在从定位标记中提取文本时遇到了麻烦。

<a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&amp;ie=UTF8&amp;qid=1343628292&amp;sr=1-1&amp;keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a> 

Here is the link for the entire HTML page .这是整个HTML页面的链接。

Here is my code:这是我的代码:

for div in soup.findAll('div', attrs={'class':'image'}):
    print "\n"
    for data in div.findNextSibling('div', attrs={'class':'data'}):
        for a in data.findAll('a', attrs={'class':'title'}):
            print a.text
    for img in div.findAll('img'):
        print img['src']

What I am trying to do is extract the image src (link) and the title inside the div class=data , so for example:我想做的是提取图像src(链接)和div class=data内的标题,因此例如:

 <a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&amp;ie=UTF8&amp;qid=1343628292&amp;sr=1-1&amp;keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a> 

should extract:应该提取:

Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)

I want to extract:我要提取:

  • text from following src of the image tag and来自image标签的src之后的文本,以及
  • text of the anchor tag which is inside the div class data div类数据内的定位标记的文本

I successfully manage to extract the img src, but am having trouble extracting the text from the anchor tag.我成功地提取了img src,但是在从定位标记中提取文本时遇到了麻烦。

<a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&amp;ie=UTF8&amp;qid=1343628292&amp;sr=1-1&amp;keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a> 

Here is the link for the entire HTML page .这是整个HTML页面的链接。

Here is my code:这是我的代码:

for div in soup.findAll('div', attrs={'class':'image'}):
    print "\n"
    for data in div.findNextSibling('div', attrs={'class':'data'}):
        for a in data.findAll('a', attrs={'class':'title'}):
            print a.text
    for img in div.findAll('img'):
        print img['src']

What I am trying to do is extract the image src (link) and the title inside the div class=data , so for example:我想做的是提取图像src(链接)和div class=data内的标题,因此例如:

 <a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&amp;ie=UTF8&amp;qid=1343628292&amp;sr=1-1&amp;keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a> 

should extract:应该提取:

Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)

To get the href out of an anchor tag use tag.get("href") and to get the img src you use tag.img.get("src") .要从锚标记中获取 href,请使用tag.get("href")并获取 img src,请使用tag.img.get("src")

Example, using this data:例如,使用此数据:

data = """
            <div class="image">
            <a href="http://www.example.com/eg1">Content1<img src="http://image.example.com/img1.jpg" /></a>
            </div>
            <div class="image">
            <a href="http://www.example.com/eg2">Content2<img src="http://image.example.com/img2.jpg" /> </a>
            </div>
        """

Get the links and texts:获取链接和文本:

import requests
from bs4 import BeautifulSoup

def get_soup(url):
    response = requests.get(url)
    if response.ok:
        return BeautifulSoup(response.text, features="html.parser")

def get_links(soup):
    links = []
    for tag in soup.findAll("a", href=True):
        if img := tag.img:
            img = img.get("src")
        links.append(dict(url=tag.get("href"), text=tag.text, img=img))
    return links

# soup = get_soup('www.example.com')
soup = BeautifulSoup(data, features="html.parser")
links = get_links(soup)

Outputs:输出:

[{'url': 'http://www.example.com/eg1', 'text': 'Content1', 'img': 'http://image.example.com/img1.jpg'},
{'url': 'http://www.example.com/eg2', 'text': 'Content2 ', 'img': 'http://image.example.com/img2.jpg'}]

I want to extract:我要提取:

  • text from following src of the image tag and来自image标签的src之后的文本,以及
  • text of the anchor tag which is inside the div class data div类数据内的定位标记的文本

I successfully manage to extract the img src, but am having trouble extracting the text from the anchor tag.我成功地提取了img src,但是在从定位标记中提取文本时遇到了麻烦。

<a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&amp;ie=UTF8&amp;qid=1343628292&amp;sr=1-1&amp;keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a> 

Here is the link for the entire HTML page .这是整个HTML页面的链接。

Here is my code:这是我的代码:

for div in soup.findAll('div', attrs={'class':'image'}):
    print "\n"
    for data in div.findNextSibling('div', attrs={'class':'data'}):
        for a in data.findAll('a', attrs={'class':'title'}):
            print a.text
    for img in div.findAll('img'):
        print img['src']

What I am trying to do is extract the image src (link) and the title inside the div class=data , so for example:我想做的是提取图像src(链接)和div class=data内的标题,因此例如:

 <a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&amp;ie=UTF8&amp;qid=1343628292&amp;sr=1-1&amp;keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a> 

should extract:应该提取:

Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)

I want to extract:我要提取:

  • text from following src of the image tag and来自image标签的src之后的文本,以及
  • text of the anchor tag which is inside the div class data div类数据内的定位标记的文本

I successfully manage to extract the img src, but am having trouble extracting the text from the anchor tag.我成功地提取了img src,但是在从定位标记中提取文本时遇到了麻烦。

<a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&amp;ie=UTF8&amp;qid=1343628292&amp;sr=1-1&amp;keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a> 

Here is the link for the entire HTML page .这是整个HTML页面的链接。

Here is my code:这是我的代码:

for div in soup.findAll('div', attrs={'class':'image'}):
    print "\n"
    for data in div.findNextSibling('div', attrs={'class':'data'}):
        for a in data.findAll('a', attrs={'class':'title'}):
            print a.text
    for img in div.findAll('img'):
        print img['src']

What I am trying to do is extract the image src (link) and the title inside the div class=data , so for example:我想做的是提取图像src(链接)和div class=data内的标题,因此例如:

 <a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&amp;ie=UTF8&amp;qid=1343628292&amp;sr=1-1&amp;keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a> 

should extract:应该提取:

Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)

soup.find('a', attrs={'class':'class_name'}).text

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM