简体   繁体   中英

how I can get all images if i'm using beautiful soup?

How I can a image if code like this:

<div class="galery-images">
<div class="galery-images-slide" style="width: 760px;">
<div class="galery-item galery-item-selected" style="background-image: url(/images/photo/1/20130206/30323/136666697057736800.jpg);"></div>

I want to get 136666697057736800.jpg I wrote:

 images = soup.select("div.galery-item")

And i get a list:

[<div class="galery-item galery-item-selected" style="background-image: url(/images/photo/1/20130206/30323/136666697057736800.jpg);"></div>, 
<div class="galery-item" style="background-image: url(/images/photo/1/20130206/30323/136013892671126300.jpg);" ></div>, 
<div class="galery-item" style="background-image: url(/images/photo/1/20130206/30323/136666699218876700.jpg);"></div>]

I dont understand: how I can get all images?

Use regex or a css parser to extract the url , concatenate the host to the beginning of the URL, finally download the image like this.

import urllib

urllib.urlretrieve("https://www.google.com/images/srpr/logo11w.png", "google.png")

To make your life easier, you should use a regex:

urls = []

for ele in soup.find_all('div', attrs={'class':'galery-images-slide'}):
    pattern = re.compile('.*background-image:\s*url\((.*)\);')
    match = pattern.match(ele.div['style'])
    if match:
        urls.append(match.group(1))

This works by finding all the divs belonging to the parent div (which has the class: 'galery-images-slide'). Then, you can parse the child divs to find any that contain the style (which itself contains the background-url) using a regex.

So, from your above example, this will output:

[u'/images/photo/1/20130206/30323/136666697057736800.jpg']

Now, to download the specified image, you append the site name in front of the url, and you should be able to download it.

NOTE:

This requires the regex module ( re ) in Python in addition to BeautifulSoup . And, the regex I used is quite naive. But, you can adjust this as required to suit your needs.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM