简体   繁体   中英

Relative path to Absolute paths of images scraped from websites in Python

Iv scraped a website for images which will then be downloaded, however to be able to download them i need to find the absolute path of the images as this is what iv managed to scrape:

2001.JPG big.jpg pics.gif gchq.jpg

all of these images are stored in the variable images im looking for one function which could find all of the absolute paths at once and store them in a variable?

This is the code i use to scrape the images:

images = re.findall(r'src=[\\"|\\']([^\\"|\\']+)[\\"|\\']',webpage.decode())

(i've had a look at various other similar questions on here but none seem to do multiple images at once)

If anyone could point me in the right direction that would be great and any suggestions for the downloading of them as well.

With BeautifulSoup & urllib you should be able to collect the images in a webpage, iterate and download them.

from urllib import urlretrieve
import urlparse
from bs4 import BeautifulSoup
import urllib2

url = "<your_url>"
soup = BeautifulSoup(urllib2.urlopen(url))
for img in soup.select('img'):
    img_url = urlparse.urljoin(url, img['src'])
    file_name = img['src'].split('/')[-1]
    urlretrieve(img_url, file_name)

Python 3 compatible code,

from bs4 import BeautifulSoup
from urllib.request import urlopen, urlretrieve
from urllib.parse import urljoin

url = "<url>"
soup = BeautifulSoup(urlopen(url))

for img in soup.find_all('img'):
    img_url = urljoin(url, img['src'])
    file_name = img['src'].split('/')[-1]
    urlretrieve(img_url, file_name)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM