Webscraping images in python with selenium and beautifulsoup from an AJAX website

Question

I've spent a long time trying to go through the html, javascript, network traffic, etc, and learning a lot about javascript, blobs, base64 decoding/encoding of images but I still can't seem to figure out how to extract the images in these videos from this website: https://www.jamesallen.com/loose-diamonds/all-diamonds/

Here's what I know: Each video is actually a set of up to 512 images, which are retrieved from a server via files titled setX.bin (X is a number). Then they are parsed via an int array into a blob object (There's also some base64 but I forget where), that is somehow converted into an image.

Following the source code is very difficult as it is purposely written as spaghetti code.

How can I extract each diamond's images and do so efficiently?

My one solution is:

I can get the setX.bin files very easily, and if I just 'pass' them into the javascript functions somehow then I should be good.

My second solution is:

to rotate each diamond manually and extract the images from the cache or something like that.

I'd like to use python to do this.

EDIT: I found javascript here on SO that does gives the 'SecurityError: The operation is not secure'. Here it is:

function exportCanvasAsPNG(id, fileName) {

    var canvasElement = document.getElementById(id);
    canvasElement.crossOrigin = "anonymous";
    var MIME_TYPE = "image/png";

    var imgURL = canvasElement.toDataURL(MIME_TYPE);
    window.console.log(canvasElement);
    var dlLink = document.createElement('a');
    dlLink.download = fileName;
    dlLink.href = imgURL;
    dlLink.dataset.downloadurl = [MIME_TYPE, dlLink.download, dlLink.href].join(':');

    document.body.appendChild(dlLink);
    dlLink.click();
    document.body.removeChild(dlLink);
}

exportCanvasAsPNG("canvas-key-_w5qzvdqpl",'asdf.png');

I ran it from Firefox console. When I ran a similar execute script in python, I got the same issue.

I want to be able to scrape all 360 degree images for each canvas.

Edit2: To make this question simpler, I know how to get the setX.bin files, but I don't know how to covert this collection of images from bin to jpg. Each bin file is multiple jpg files.

Answer 1

The .bin files appear to just contain the jpegs concatenated together with some leading metadata. You can simply iterate through the bytes of the file looking for jpeg file signatures ( 0xFFD8 ) and slice out each image:

JPEG_MAGIC = b"\xff\xd8"

with open("set0.bin", "rb") as f:
    s = f.read()

i = 0
start_index = s.find(JPEG_MAGIC)

while True:
    end_index = s.find(JPEG_MAGIC, start_index + 1)

    if end_index == -1:
        end_index = len(s)

    with open(f"out{i:03}.jpg", "wb") as out:
        out.write(s[start_index:end_index])

    if end_index == len(s):
        break

    start_index = end_index

    i += 1

Result:

Webscraping images in python with selenium and beautifulsoup from an AJAX website

Question

1 answers

solution1
2 ACCPTED 2019-02-05 21:24:07

Webscraping images in python with selenium and beautifulsoup from an AJAX website

Question

1 answers

solution1 2 ACCPTED 2019-02-05 21:24:07

solution1
2 ACCPTED 2019-02-05 21:24:07