简体   繁体   English

在 Python bs4 中为 img src 抓取 html

[英]Scraping html for img src in Python bs4

I have the following HTML code I am trying to parse with BeautifulSoup in Python, or bs4:我有以下 HTML 代码,我试图用 BeautifulSoup 在 Python 或 bs4 中解析:

  <div class="product w-100" data-pid="BBOMNLV1-36183" data-sid="BBOMNLWB">
        <div class="product-tile w-100">
            <!-- dwMarker="product" dwContentID="c4e921241579720afa4287dbf5" -->
            <div class="image-container">
                <a href="/pd/omn1s-low/BBOMNLV1-36183.html?dwvar_BBOMNLV1-36183_style=BBOMNLWB">
                    <picture>
                        <source type="image/jpeg" data-srcset="https://nb.scene7.com/is/image/NB/bbomnlwb_nb_02_i?$pdpflexf2$&amp;wid=440&amp;hei=440 1x, https://nb.scene7.com/is/image/NB/bbomnlwb_nb_02_i?$pdpflexf2$&amp;wid=880&amp;hei=880 2x" srcset="https://nb.scene7.com/is/image/NB/bbomnlwb_nb_02_i?$pdpflexf2$&amp;wid=440&amp;hei=440 1x, https://nb.scene7.com/is/image/NB/bbomnlwb_nb_02_i?$pdpflexf2$&amp;wid=880&amp;hei=880 2x"> <img class="tile-image ls-is-cached lazyloaded" src="https://nb.scene7.com/is/image/NB/bbomnlwb_nb_02_i?$pdpflexf2$&amp;wid=440&amp;hei=440" data-src="https://nb.scene7.com/is/image/NB/bbomnlwb_nb_02_i?$pdpflexf2$&amp;wid=440&amp;hei=440" data-srcset="https://nb.scene7.com/is/image/NB/bbomnlwb_nb_02_i?$pdpflexf2SM$&amp;wid=440&amp;hei=440 1x, https://nb.scene7.com/is/image/NB/bbomnlwb_nb_02_i?$pdpflexf2SM$&amp;wid=880&amp;hei=880 2x" alt="OMN1S Low" title="OMN1S Low, BBOMNLWB" itemprop="image" srcset="https://nb.scene7.com/is/image/NB/bbomnlwb_nb_02_i?$pdpflexf2SM$&amp;wid=440&amp;hei=440 1x, https://nb.scene7.com/is/image/NB/bbomnlwb_nb_02_i?$pdpflexf2SM$&amp;wid=880&amp;hei=880 2x"> </picture>
                </a>
                <div class="product-id d-none">BBOMNLV1-36183</div>
                <div class="wishlist-url d-none">/on/demandware.store/Sites-NBUS-Site/en_US/Wishlist-WishlistItemExists</div> <span class="wishListToggle">
                <a class="wishlistTile add-to-wish-list" href="/on/demandware.store/Sites-NBUS-Site/en_US/Wishlist-AddProduct" title="Wish list">
                    <span class="wishlist-inactive active">
    <svg role="img" class="icon svg-icon " width="24" height="24" aria-label="title">
    <title> </title>
    <desc> </desc>
    <use xlink:href="#wishlist-inactive"></use>
    </svg></span> </a>
                <a class="wishlistTile remove-from-wishlist" href="/on/demandware.store/Sites-NBUS-Site/en_US/Wishlist-RemoveProduct" title="Wish list"> <span class="wishlist-active ">
    <svg role="img" class="icon svg-icon " width="24" height="24" aria-label="title">
    <title> </title>
    <desc> </desc>
    <use xlink:href="#wishlist-active"></use>
    </svg></span> </a>
                </span>
            </div>
            <div class="tile-body">
                <div class="row pgp-grid pb-2 pr-2">
                    <div class="col-12 col-lg-7 pl-2 fw-search">
                        <div class="pdp-link"> <a class="link font-weight-bold pname text-underline no-underline-lg" href="/pd/omn1s-low/BBOMNLV1-36183.html?dwvar_BBOMNLV1-36183_style=BBOMNLWB">OMN1S Low</a> <span class="category-name font-body w-100 d-block pt-2">
            
                Men's Basketball
            
        </span> </div>
                    </div>
                    <div class="col-12 col-lg-5 pl-2 fw-search justify-content-lg-end text-right d-flex p-0 search-tile">
                        <div class="price"> <span class="price-value">
        
    
        
        <span class="sales font-body-large ">
            
            
            
            $139.99
    
    
        </span> </span>
                        </div>
                    </div>
                </div>
                <div class="pgp-reviews-wrapper" data-pageid="BBOMNLV1-36183" data-url="https://www.newbalance.com/on/demandware.store/Sites-NBUS-Site/en_US/ProductReviews-WriteReview?pid=BBOMNLV1-36183" id="BBOMNLV1-36183-pgp-reviews-wrapper-3">
                    <div class="p-w-r">
                        <section id="pr-category-snippets-BBOMNLV1-36183" class="pr-no-reviews" aria-labelledby="pr-UbCtutN-xQJECAE6zEJSy" data-testid="category-snippet">
                            <div class="pr-snippet pr-category-snippet">
                                <div class="pr-category-snippet__rating pr-category-snippet__item">
                                    <div class="pr-snippet-stars pr-snippet-stars-png ">
                                        <div aria-hidden="true" class="pr-rating-stars">
                                            <div class="pr-star-v4 pr-star-v4-0-filled"></div>
                                            <div class="pr-star-v4 pr-star-v4-0-filled"></div>
                                            <div class="pr-star-v4 pr-star-v4-0-filled"></div>
                                            <div class="pr-star-v4 pr-star-v4-0-filled"></div>
                                            <div class="pr-star-v4 pr-star-v4-0-filled"></div>
                                        </div>
                                        <div aria-hidden="true" class="pr-snippet-rating-decimal">0.0</div>
                                    </div><span id="pr-UbCtutN-xQJECAE6zEJSy" class="pr-accessible-text">Rated 0 out of 5 stars</span></div>
                                <div class="pr-category-snippet__total pr-category-snippet__item">No Reviews</div>
                            </div>
                        </section>
                    </div>
                </div>
            </div>
            <div class="badges"> <span class="sub-badges p-1 text-uppercase font-weight-bold">NEW</span> </div>
            <!-- END_dwmarker -->
        </div>
    </div>

I am trying to retrieve the shoe's pciture by finding img tag with the class "tile-image ls-is-cached lazyloaded", then I try to retrieve the data-src attribute to acquire the link of the photograph.我试图通过查找带有 class“tile-image ls-is-cached lazyloaded”的 img 标签来检索鞋子的照片,然后我尝试检索 data-src 属性以获取照片的链接。

Here is my bs4 code, which does not seem to work:这是我的 bs4 代码,它似乎不起作用:

from bs4 import BeautifulSoup
def queryNewBalance(uri):
    r = requests.get('https://www.newbalance.com/men/shoes/basketball/?prefn1=color&prefv1=Black%7CBlue&srule=null')
    soup = BeautifulSoup(r.content, 'html.parser')
    result = soup.find_all('div', class_='product w-100')
    for res in result:
        print("*******************************")
        print(res.find('img', class_='tile-image ls-is-cached lazyloaded')['href]) #Picture
        print("*******************************")
    print(f"\nFound total shoes: {len(result)}")

How do I fix my code to retrieve the image link?如何修复我的代码以检索图像链接?

It seems that the attribute you are getting is href the <img> tag you are trying to scrape doesn't have that attribute , it has the src attribute and that is where the link is.看起来你得到的属性是href你试图抓取的<img>标签没有那个attribute ,它有src attribute ,这就是链接所在的位置。 By the way put your html parameter that long html code you provided.顺便说一句,将您提供的长html代码放入您的html参数。

def queryNewBalance(html):
    #r = requests.get('https://www.newbalance.com/men/shoes/basketball/?prefn1=color&prefv1=Black%7CBlue&srule=null')
    soup = BeautifulSoup(html, 'html.parser')
    result = soup.find_all('div', class_='product w-100')
    for res in result:
        print("*******************************")
        print(res.find('img', class_='tile-image ls-is-cached lazyloaded')['src']) #Picture
        print("*******************************")
    print(f"\nFound total shoes: {len(result)}")



queryNewBalance(html)

Output Output

*******************************
https://nb.scene7.com/is/image/NB/bbomnlwb_nb_02_i?$pdpflexf2$&wid=440&hei=440
*******************************

Found total shoes: 1
[Finished in 0.7s]

--- URL --- --- URL ---

from bs4 import BeautifulSoup
import requests

def queryNewBalance():
    r = requests.get('https://www.newbalance.com/men/shoes/basketball/?prefn1=color&prefv1=Black%7CBlue&srule=null')
    soup = BeautifulSoup(r.content, 'html.parser')
    result = soup.find_all('div', class_='product w-100')
    for res in result:
        print("*******************************")
        print(res.find('img', class_='tile-image')["data-src"]) #Picture
        print("*******************************")
    print(f"\nFound total shoes: {len(result)}")



queryNewBalance()

Output: Output:

*******************************
https://nb.scene7.com/is/image/NB/bbomnxbb_nb_02_i?$pdpflexf2$&wid=440&hei=440
*******************************
*******************************
https://nb.scene7.com/is/image/NB/bbomnlpl_nb_02_i?$pdpflexf2$&wid=440&hei=440
*******************************
*******************************
https://nb.scene7.com/is/image/NB/bbomnlwb_nb_02_i?$pdpflexf2$&wid=440&hei=440
*******************************
*******************************
https://nb.scene7.com/is/image/NB/bbomnlbr_nb_02_i_5a34b3da900d437a9a88?$pdpflexf2$&wid=440&hei=440
*******************************
*******************************
https://nb.scene7.com/is/image/NB/bbomnlfc_nb_02_i?$pdpflexf2$&wid=440&hei=440
*******************************
*******************************
https://nb.scene7.com/is/image/NB/bbomnlwt_nb_02_i?$pdpflexf2$&wid=440&hei=440
*******************************

Found total shoes: 6
[Finished in 2.9s]

PS: If you are getting involved more with web scraping, and scraping tons of websites especially the big ones, I suggest you change your parser to html5lib -> pip install html5lib . PS:如果您更多地参与 web 抓取,并抓取大量网站,尤其是大型网站,我建议您将解析器更改为html5lib -> pip install html5lib It's a better parser as I've had problems scraping with html.parser , it just doesn't somehow scrape some part of a website tho I've check the soup object where it has, anyways your call, Good luck!它是一个更好的解析器,因为我在使用html.parser进行抓取时遇到了问题,它只是没有以某种方式抓取网站的某些部分,虽然我已经检查了汤 object 它在哪里,无论如何你的电话,祝你好运!

There's no class tile-image ls-is-cached lazyloaded on the page.页面上没有 class tile-image ls-is-cached lazyloaded To get the links of the images, you can use a CSS Selector img[itemprop='image'] :要获取图像的链接,您可以使用 CSS 选择器img[itemprop='image']

import requests
from bs4 import BeautifulSoup

def queryNewBalance(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.content, "html.parser")
    result = soup.find_all("div", class_="product w-100")
    for res in result:
        print("*******************************")
        print(res.select_one("img[itemprop='image']")["data-src"])
    print(f"\nFound total shoes: {len(result)}")


queryNewBalance(
    "https://www.newbalance.com/men/shoes/basketball/?prefn1=color&prefv1=Black%7CBlue&srule=null"
)

Output: Output:

*******************************
https://nb.scene7.com/is/image/NB/bbomnxbb_nb_02_i?$pdpflexf2$&wid=440&hei=440
*******************************
https://nb.scene7.com/is/image/NB/bbomnlpl_nb_02_i?$pdpflexf2$&wid=440&hei=440
*******************************
https://nb.scene7.com/is/image/NB/bbomnlwb_nb_02_i?$pdpflexf2$&wid=440&hei=440
*******************************
https://nb.scene7.com/is/image/NB/bbomnlbr_nb_02_i_5a34b3da900d437a9a88?$pdpflexf2$&wid=440&hei=440
*******************************
https://nb.scene7.com/is/image/NB/bbomnlfc_nb_02_i?$pdpflexf2$&wid=440&hei=440
*******************************
https://nb.scene7.com/is/image/NB/bbomnlwt_nb_02_i?$pdpflexf2$&wid=440&hei=440

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM