简体   繁体   English

Python脚本,使用BeautifulSoup将网站上的所有图像下载到指定的文件夹

[英]Python script to download all images from a website to a specified folder with BeautifulSoup

I found this post and wanted to modify the script slightly to download the images to a specific folder. 我找到了这篇文章,并想稍微修改一下脚本以将图像下载到特定的文件夹。 My edited file looks like this: 我编辑的文件如下所示:

import re
import requests
from bs4 import BeautifulSoup
import os

site = 'http://pixabay.com'
directory = "pixabay/" #Relative to script location

response = requests.get(site)

soup = BeautifulSoup(response.text, 'html.parser')
img_tags = soup.find_all('img')

urls = [img['src'] for img in img_tags]

for url in urls:
    #print(url)
    filename = re.search(r'/([\w_-]+[.](jpg|gif|png))$', url)

    with open(os.path.join(directory, filename.group(1)), 'wb') as f:
        if 'http' not in url:
            url = '{}{}'.format(site, url)
        response = requests.get(url)
        f.write(response.content)

This seems to work fine for pixabay , but if I try a different site like imgur or heroimages , it doesn't seem to work. 这似乎可以很好地用于付费 ,但是如果我尝试使用imgurheroimages这样的其他网站,则似乎无法正常工作。 If I replace the site declaration with 如果我将网站声明替换为

site = 'http://heroimages.com/portfolio'

nothing is downloaded. 没有下载任何内容。 The print statement (when uncommented) doesn't print anything, so I'm guessing it's not finding any image tags? 打印语句(未注释时)不打印任何内容,因此我猜它没有找到任何图像标签? I'm not sure. 我不确定。

On the other hand, if I replace site with 另一方面,如果我将网站替换为

site = 'http://imgur.com'

I sometimes get a 我有时会得到

AttributeError: 'NoneType' object has no attribute 'group'

or, if the images do download, I can't even open them because I get the following error: 或者,如果图像确实下载,由于出现以下错误,我什至无法打开它们: 不支持的文件格式

Also worth noting, right now the script requires the folder specified by directory to exist. 同样值得注意的是,现在脚本需要目录指定的文件夹存在。 I plan on changing it in the future so that the script creates the directory, if it does not exist already. 我计划将来更改它,以便脚本创建目录(如果尚不存在)。

you need to wait for javascript to laod the page, i think in this is the problem, if you want you can use selenium 您需要等待javascript刷新页面,我认为这是问题所在,如果您希望使用selenium

# your imports
...
from selenium import webdriver

site = 'http://heroimages.com/portfolio'
directory = "pixabay/" #Relative to script location

driver = webdriver.Chrome('/usr/local/bin/chromedriver')

driver.get(site)

soup = BeautifulSoup(driver.page_source, 'html.parser')
img_tags = soup.find_all('img')

urls = [img['src'] for img in img_tags]

for url in urls:
    print(url)
    # your code
    ...

Output 产量

# from `http://heroimages.com/portfolio`
https://ssl.c.photoshelter.com/img-get2/I00004gQScPHUm5I/sec=wdtsdfoeflwefms1440ed201806304risXP3bS2xDXil/fill=350x233/361-03112.jpg
https://ssl.c.photoshelter.com/img-get2/I0000h9YWTlnCxXY/sec=wdtsdfoeflwefms1440ed20180630Nq90zU4qg6ukT5K/fill=350x233/378-01449.jpg
https://ssl.c.photoshelter.com/img-get2/I0000HNg_JtT_QrQ/sec=wdtsdfoeflwefms1440ed201806304CZwwO1L641maB9/fill=350x233/238-1027-hro-3552.jpg
https://ssl.c.photoshelter.com/img-get2/I00000LWwYspqXuk/sec=wdtsdfoeflwefms1440ed201806302BP_NaDsGb7udq0/fill=350x233/258-02351.jpg
# and many others images

Also the script that check if the directory exists, if it don't exist create it. 还要检查该目录是否存在的脚本,如果该目录不存在,请创建该脚本。

...
directory = os.path.dirname(os.path.realpath(__file__)) + '/pixabay/'    
if not os.path.exists(directory):
    os.makedirs(directory)
...                  

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用 beautifulSoup 从网站提取和下载所有图像? - How to extract and download all images from a website using beautifulSoup? 使用 Beautifulsoup 下载 python 中的图像 - download images in python with Beautifulsoup 如何使用Python的BeautifulSoup从网站下载特定的GIF图像(条件:phd * .gif)? - How to download specific GIF images (condition: phd*.gif) from a website using Python's BeautifulSoup? 从网站下载所有 .pdf 文件的 Python/Java 脚本 - Python/Java script to download all .pdf files from a website 使用Dropbox API下载指定文件夹中的所有图像,而不下载子文件夹中的所有图像 - Download all images in a specified folder and not the sub folders using dropbox API 如何使用python和BeautifulSoup从网站下载.qrs文件? - How to download .qrs files from a website using python and BeautifulSoup? python beautifulsoup 从网站下载特定的.zip 文件 - python beautifulsoup to download specific .zip files from a website 使用Python和Selenium从网站高效下载图像 - Efficient download of images from website with Python and selenium 如何从 BeautifulSoup 下载图片? - How to download images from BeautifulSoup? Python Beautifulsoup:如何从div下载图像,然后将其复制到Word文档? - Python Beautifulsoup: How to download images from a div and then copy it to word document?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM