简体   繁体   中英

Download bing image search results using python (custom url)

I want to download bing search images using python code.

Example URL: https://www.bing.com/images/search?q=sketch%2520using%20iphone%2520students

My python code generates an url of bing search as shown in example. Next step, is to download all images shown in that link on my local desktop.

In my project i am generating some words in python and my code generates bing image search URL. All i need is to download images shown on that search page using python.

To download an image, you need to make a request to the image URL that ends with .png , .jpg etc.

But Bing provides a "m" attribute inside the <a> element that stores needed data in the JSON format from which you can parse the image URL that is stored in the "murl" key and download it afterward.

图片

To download all images locally to your computer, you can use 2 methods:

# bs4

for index, url in enumerate(soup.select(".iusc"), start=1):
    img_url = json.loads(url["m"])["murl"]
    image = requests.get(img_url, headers=headers, timeout=30)
    query = query.lower().replace(" ", "_")
    
    if image.status_code == 200:
        with open(f"images/{query}_image_{index}.jpg", 'wb') as file:
            file.write(image.content)
# urllib

for index, url in enumerate(soup.select(".iusc"), start=1):
    img_url = json.loads(url["m"])["murl"]
    query = query.lower().replace(" ", "_")

    opener = req.build_opener()
    opener.addheaders=[("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36")]
    req.install_opener(opener)
    req.urlretrieve(img_url, f"images/{query}_image_{index}.jpg")

In the first case, you can use context manager with open() to load the image locally. In the second case, you can use urllib.request.urlretrieve method of the urllib.request library .

Also, make sure you're using request headers user-agent to act as a "real" user visit. Because default requests user-agent is python-requests and websites understand that it's most likely a script that sends a request. Check what's your user-agent .

Note: An error might occur with the urllib.request.urlretrieve method where some of the request has got a captcha or something else that returns an unsuccessful status code . The biggest problem is it's hard to test for response code while requests provide a status_code method to test it.


Code and full example in online IDE :

from bs4 import BeautifulSoup
import requests, lxml, json

query = "sketch using iphone students"

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "q": query,
    "first": 1
}

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36"
}

response = requests.get("https://www.bing.com/images/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(response.text, "lxml")

for index, url in enumerate(soup.select(".iusc"), start=1):
    img_url = json.loads(url["m"])["murl"]
    image = requests.get(img_url, headers=headers, timeout=30)
    query = query.lower().replace(" ", "_")
    
    if image.status_code == 200:
        with open(f"images/{query}_image_{index}.jpg", 'wb') as file:
            file.write(image.content)

Using urllib.request.urlretrieve .

from bs4 import BeautifulSoup
import requests, lxml, json
import urllib.request as req

query = "sketch using iphone students"

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "q": query,
    "first": 1
}

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36"
}

response = requests.get("https://www.bing.com/images/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(response.text, "lxml")

for index, url in enumerate(soup.select(".iusc"), start=1):
    img_url = json.loads(url["m"])["murl"]
    query = query.lower().replace(" ", "_")

    opener = req.build_opener()
    opener.addheaders=[("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36")]
    req.install_opener(opener)
    req.urlretrieve(img_url, f"images/{query}_image_{index}.jpg")

Output:

上传文件到文件夹的演示

edit your code to find the designated image url and then use this code

use urllib.request

import urllib.request as req

    imgurl ="https://i.ytimg.com/vi/Ks-_Mh1QhMc/hqdefault.jpg"

    req.urlretrieve(imgurl, "image_name.jpg")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM