简体   繁体   English

美汤无法访问网页中的所有图像

[英]Beautiful Soup can't access all images in a webpage

I am trying to download all images of a product in Trendyol .我正在尝试在Trendyol 中下载产品的所有图像。 The Soup does not see the div "gallery-container" when I try to use the find_all method.当我尝试使用 find_all 方法时,Soup 没有看到 div“gallery-container”。 It's getting empty.越来越空了After that, I tried to find all images on the page.之后,我尝试在页面上查找所有图像。 I can't access all images.我无法访问所有图像。 I get low-resolution images.我得到低分辨率的图像。 However, when I check from chrome inspect, I can see the images with high-resolution.但是,当我从 chrome 检查中检查时,我可以看到高分辨率的图像。 How can I get images?我怎样才能得到图像?

Screenshots截图

Code - GetProductInfo.py代码 - GetProductInfo.py

from bs4 import BeautifulSoup
from requests_html import HTMLSession
from os.path  import basename
import requests

class GetProductInfo:
    global session
    global html_text

def __init__(self,link):
    session = HTMLSession()
    html_text = session.get(link)
    html_text.html.render()
    self.soup = BeautifulSoup(html_text.html.html, 'lxml')

def saveImages(self):
    for link in self.soup.select("img[src^=http]"):
        lnk = link["src"]
        with open("images/"+basename(lnk),"wb") as f:
            f.write(requests.get(lnk).content)

Code - test.py代码 - test.py

from GetProductInfo import *
product = GetProductInfo("https://www.trendyol.com/aqua-di-polo-1987/unisex-kol-saati-apl12c195h02-p-3929108")
product.saveImages()

You could simply regex out the JavaScript object hosting the image links, then parse with json.您可以简单地将托管图像链接的 JavaScript 对象正则表达式,然后使用 json 进行解析。 No need to render the page at all.根本不需要渲染页面。 The links are for the high resolution images viewable in that gallery.这些链接适用于该图库中可查看的高分辨率图像。

import requests, json, re

r = requests.get('https://www.trendyol.com/aqua-di-polo-1987/unisex-kol-saati-apl12c195h02-p-3929108')
data = json.loads(re.search(r'PRODUCT_DETAIL_APP_INITIAL_STATE__=(.*?);', r.text).group(1))
images = ['https://www.trendyol.com' + img for img in data['product']['images']]
print(images)

Hey I'm testing out on how github copilot solves stackoverflow questions.嘿,我正在测试 github copilot 如何解决 stackoverflow 问题。

# Problem: Beautiful Soup can't access all images in a webpage
# Solution:
# 1. Get all images from the webpage
# 2. Save all images to a folder

import requests
from bs4 import BeautifulSoup
import os


# Get all images from a webpage
def get_all_images(url):
    # get the webpage
    r = requests.get(url)

    # create a BeautifulSoup object
    soup = BeautifulSoup(r.text, "html.parser")

    # find all images
    images = [img for img in soup.findAll('img')]

    # return all images
    print(str(len(images)) + " images found.")
    # print(images)
    # print(str(len(images)) + " images found.")
    return images


# Save all images to a folder
def save_all_images(images, folder):
    # create a directory to store images
    if not os.path.exists(folder):
        os.makedirs(folder)
    print("Saving images to " + folder)

    # save all images to the folder
    for i, img in enumerate(images):
        try:
            # get the image source
            img_data = requests.get(img['src']).content

            # get the image name
            with open(folder + '/' + str(i) + '.' + img['src'].split('.')[-1], 'wb') as handler:
                # save the image to the folder
                handler.write(img_data)
        except Exception as e:
            # skip the image if any error
            print(e)
    print("All images saved.")


# Main
def main():
    url = "https://www.trendyol.com/aqua-di-polo-1987/unisex-kol-saati-apl12c195h02-p-3929108"
    images = get_all_images(url)
    save_all_images(images, 'images')


main()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM