简体   繁体   English

Python Selenium 使用 ChromeDriver 下载图像(jpeg、png)或 PDF

[英]Python Selenium download images (jpeg, png) or PDF using ChromeDriver

I have a Selenium script in Python (using ChromeDriver on Windows) that fetches the download links of various attachments(of different file types) from a page and then opens these links to download the attachments.我有一个 Python 中的 Selenium 脚本(在 Windows 上使用 ChromeDriver),它从页面中获取各种附件(不同文件类型)的下载链接,然后打开这些链接以下载附件。 This works fine for the file types which ChromeDriver can't preview as they get downloaded by default.这适用于 ChromeDriver 默认下载时无法预览的文件类型。 But images(JPEG, PNG) and PDFs are previewed by default and hence aren't automatically downloaded.但是默认情况下会预览图像(JPEG、PNG)和 PDF,因此不会自动下载。

The ChromeDriver options I am currently using (work for non preview-able files) :我目前使用的 ChromeDriver 选项(适用于不可预览的文件):

chrome_options = webdriver.ChromeOptions()
prefs = {'download.default_directory' : 'custom_download_dir'}
chrome_options.add_experimental_option('prefs', prefs)
driver = webdriver.Chrome("./chromedriver.exe", chrome_options=chrome_options)

This downloads the files to 'custom_download_dir', no issues.这会将文件下载到“custom_download_dir”,没有问题。 But the preview-able files are just previewed in the ChromeDriver instance and not downloaded.但是可以预览的文件只是在 ChromeDriver 实例中预览而不是下载。

Are there any ChromeDriver Settings that can disable this preview behavior and directly download all files irrespective of the extensions?是否有任何 ChromeDriver 设置可以禁用此预览行为并直接下载所有文件而不管扩展程序如何?

If not, can this be done using Firefox for instance?如果没有,例如可以使用 Firefox 来完成吗?

Instead of relying in specific browser / driver options I would implement a more generic solution using the image url to perform the download.我将使用图像 url 来实现更通用的解决方案,而不是依赖特定的浏览器/驱动程序选项来执行下载。

You can get the image URL using similar code:您可以使用类似的代码获取图像 URL:

driver.find_element_by_id("your-image-id").get_attribute("src")

And then I would download the image using, for example, urllib.然后我会使用例如 urllib 下载图像。

Here's some pseudo-code for Python2:这是 Python2 的一些伪代码:

import urllib

url = driver.find_element_by_id("your-image-id").get_attribute("src")
urllib.urlretrieve(url, "local-filename.jpg")

Here's the same for Python3: Python3 也是如此:

import urllib.request

url = driver.find_element_by_id("your-image-id").get_attribute("src")
urllib.request.urlretrieve(url, "local-filename.jpg")

Edit after the comment, just another example about how to download a file once you know its URL:在评论后编辑,这是另一个关于如何在知道其 URL 后下载文件的示例:

import requests
from PIL import Image
from io import StringIO

image_name = 'image.jpg'
url = 'http://example.com/image.jpg'

r = requests.get(url)

i = Image.open(StringIO(r.content))
i.save(image_name)

With selenium-wire library, it is possible to download images via ChromeDriver .使用selenium-wire库,可以通过ChromeDriver下载图像。

I have defined the following function to parse each request and save the request body to a file when necessary.我定义了以下函数来解析每个请求,并在必要时将请求正文保存到文件中。

import os
from mimetypes import guess_extension
from seleniumwire import webdriver

def download_assets(requests, asset_dir="temp", default_fname="untitled", exts=[".png", ".jpeg", ".jpg", ".svg", ".gif", ".pdf", ".ico"]):
    asset_list = {}
    for req_idx, request in enumerate(requests):
        # request.headers
        # request.response.body is the raw response body in bytes
        ext = guess_extension(request.response.headers['Content-Type'].split(';')[0].strip())
        if ext is None or ext not in exts:
            #Don't know the file extention, or not in the whitelist
            continue

        # Construct a filename
        fname = os.path.basename(request.url.split('?')[0])
        fname = "".join(x for x in fname if (x.isalnum() or x in "._- "))
        if fname == "":
            fname = f"{default_fname}_{req_idx}"
        if not fname.endswith(ext):
            fname = f"{fname}{ext}"
        fpath = os.path.join(asset_dir, fname)

        # Save the file
        print(f"{request.url} -> {fpath}")
        asset_list[fpath] = request.url
        with open(fpath, "wb") as file:
            file.write(request.response.body)
    return asset_list

Let's download some images from Google homepage to temp folder.让我们从谷歌主页下载一些图像到temp文件夹。

# Create a new instance of the Chrome/Firefox driver
driver = webdriver.Chrome()

# Go to the Google home page
driver.get('https://www.google.com')

# Download content to temp folder
asset_dir = "temp"
os.makedirs(asset_dir, exist_ok=True)
download_assets(driver.requests, asset_dir=asset_dir)

driver.close()

Note that the function can be improved such that the directory structure can be kept as well.请注意,可以改进该功能,以便也可以保留目录结构。

Here is another simple way, but @Pitto's answer above is slightly more succinct.这是另一种简单的方法,但上面@Pitto 的回答稍微简洁一些。

import requests

webelement_img = ff.find_element(By.XPATH, '//img')
url = webelement_img.get_attribute('src') or 'https://someimages.com/path-to-image.jpg'
data = requests.get(url).content
local_filename = 'filename_on_your_computer.jpg'

with open (local_filename, 'wb') as f:
    f.write(data)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM