简体   繁体   中英

Webscraping images in python with selenium and beautifulsoup from an AJAX website

I've spent a long time trying to go through the html, javascript, network traffic, etc, and learning a lot about javascript, blobs, base64 decoding/encoding of images but I still can't seem to figure out how to extract the images in these videos from this website: https://www.jamesallen.com/loose-diamonds/all-diamonds/

Here's what I know: Each video is actually a set of up to 512 images, which are retrieved from a server via files titled setX.bin (X is a number). Then they are parsed via an int array into a blob object (There's also some base64 but I forget where), that is somehow converted into an image.

Following the source code is very difficult as it is purposely written as spaghetti code.

How can I extract each diamond's images and do so efficiently?

My one solution is:

I can get the setX.bin files very easily, and if I just 'pass' them into the javascript functions somehow then I should be good.

My second solution is:

to rotate each diamond manually and extract the images from the cache or something like that.

I'd like to use python to do this.

EDIT: I found javascript here on SO that does gives the 'SecurityError: The operation is not secure'. Here it is:

function exportCanvasAsPNG(id, fileName) {

    var canvasElement = document.getElementById(id);
    canvasElement.crossOrigin = "anonymous";
    var MIME_TYPE = "image/png";

    var imgURL = canvasElement.toDataURL(MIME_TYPE);
    window.console.log(canvasElement);
    var dlLink = document.createElement('a');
    dlLink.download = fileName;
    dlLink.href = imgURL;
    dlLink.dataset.downloadurl = [MIME_TYPE, dlLink.download, dlLink.href].join(':');

    document.body.appendChild(dlLink);
    dlLink.click();
    document.body.removeChild(dlLink);
}

exportCanvasAsPNG("canvas-key-_w5qzvdqpl",'asdf.png');

I ran it from Firefox console. When I ran a similar execute script in python, I got the same issue.

I want to be able to scrape all 360 degree images for each canvas.

Edit2: To make this question simpler, I know how to get the setX.bin files, but I don't know how to covert this collection of images from bin to jpg. Each bin file is multiple jpg files.

The .bin files appear to just contain the jpegs concatenated together with some leading metadata. You can simply iterate through the bytes of the file looking for jpeg file signatures ( 0xFFD8 ) and slice out each image:

JPEG_MAGIC = b"\xff\xd8"

with open("set0.bin", "rb") as f:
    s = f.read()

i = 0
start_index = s.find(JPEG_MAGIC)

while True:
    end_index = s.find(JPEG_MAGIC, start_index + 1)

    if end_index == -1:
        end_index = len(s)

    with open(f"out{i:03}.jpg", "wb") as out:
        out.write(s[start_index:end_index])

    if end_index == len(s):
        break

    start_index = end_index

    i += 1

Result:

在此处输入图片说明

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM