Python脚本，使用BeautifulSoup将网站上的所有图像下载到指定的文件夹

Question

I found this post and wanted to modify the script slightly to download the images to a specific folder. 我找到了这篇文章，并想稍微修改一下脚本以将图像下载到特定的文件夹。 My edited file looks like this: 我编辑的文件如下所示：

import re
import requests
from bs4 import BeautifulSoup
import os

site = 'http://pixabay.com'
directory = "pixabay/" #Relative to script location

response = requests.get(site)

soup = BeautifulSoup(response.text, 'html.parser')
img_tags = soup.find_all('img')

urls = [img['src'] for img in img_tags]

for url in urls:
    #print(url)
    filename = re.search(r'/([\w_-]+[.](jpg|gif|png))$', url)

    with open(os.path.join(directory, filename.group(1)), 'wb') as f:
        if 'http' not in url:
            url = '{}{}'.format(site, url)
        response = requests.get(url)
        f.write(response.content)

This seems to work fine for pixabay , but if I try a different site like imgur or heroimages , it doesn't seem to work. 这似乎可以很好地用于付费，但是如果我尝试使用imgur或heroimages这样的其他网站，则似乎无法正常工作。 If I replace the site declaration with 如果我将网站声明替换为

site = 'http://heroimages.com/portfolio'

nothing is downloaded. 没有下载任何内容。 The print statement (when uncommented) doesn't print anything, so I'm guessing it's not finding any image tags? 打印语句（未注释时）不打印任何内容，因此我猜它没有找到任何图像标签？ I'm not sure. 我不确定。

On the other hand, if I replace site with 另一方面，如果我将网站替换为

site = 'http://imgur.com'

I sometimes get a 我有时会得到

AttributeError: 'NoneType' object has no attribute 'group'

or, if the images do download, I can't even open them because I get the following error: 或者，如果图像确实下载，由于出现以下错误，我什至无法打开它们：

Also worth noting, right now the script requires the folder specified by directory to exist. 同样值得注意的是，现在脚本需要目录指定的文件夹存在。 I plan on changing it in the future so that the script creates the directory, if it does not exist already. 我计划将来更改它，以便脚本创建目录（如果尚不存在）。

Answer 1

you need to wait for javascript to laod the page, i think in this is the problem, if you want you can use selenium 您需要等待javascript刷新页面，我认为这是问题所在，如果您希望使用selenium

# your imports
...
from selenium import webdriver

site = 'http://heroimages.com/portfolio'
directory = "pixabay/" #Relative to script location

driver = webdriver.Chrome('/usr/local/bin/chromedriver')

driver.get(site)

soup = BeautifulSoup(driver.page_source, 'html.parser')
img_tags = soup.find_all('img')

urls = [img['src'] for img in img_tags]

for url in urls:
    print(url)
    # your code
    ...

Output 产量

# from `http://heroimages.com/portfolio`
https://ssl.c.photoshelter.com/img-get2/I00004gQScPHUm5I/sec=wdtsdfoeflwefms1440ed201806304risXP3bS2xDXil/fill=350x233/361-03112.jpg
https://ssl.c.photoshelter.com/img-get2/I0000h9YWTlnCxXY/sec=wdtsdfoeflwefms1440ed20180630Nq90zU4qg6ukT5K/fill=350x233/378-01449.jpg
https://ssl.c.photoshelter.com/img-get2/I0000HNg_JtT_QrQ/sec=wdtsdfoeflwefms1440ed201806304CZwwO1L641maB9/fill=350x233/238-1027-hro-3552.jpg
https://ssl.c.photoshelter.com/img-get2/I00000LWwYspqXuk/sec=wdtsdfoeflwefms1440ed201806302BP_NaDsGb7udq0/fill=350x233/258-02351.jpg
# and many others images

Also the script that check if the directory exists, if it don't exist create it. 还要检查该目录是否存在的脚本，如果该目录不存在，请创建该脚本。

...
directory = os.path.dirname(os.path.realpath(__file__)) + '/pixabay/'    
if not os.path.exists(directory):
    os.makedirs(directory)
...

Python脚本，使用BeautifulSoup将网站上的所有图像下载到指定的文件夹

问题描述

1 个解决方案

解决方案1
3 已采纳 2018-06-27 19:09:35

Python脚本，使用BeautifulSoup将网站上的所有图像下载到指定的文件夹

问题描述

1 个解决方案

解决方案1 3 已采纳 2018-06-27 19:09:35

解决方案1
3 已采纳 2018-06-27 19:09:35