简体   繁体   中英

Decode base64 encoded urls using selenium

I would like to scrape some images from a website. I checked the website and everything seemed pretty easy so I started with plain beautifulsoup. Then I noticed, that images are in strange format, probably base64 related, so I tried to decode it but nothing came out of it. I made a little research and I found suggestions to use selenium, because the image urls may be rendered via javascript. So I tried it with selenium with no success.

I am trying to get the image url this way:

img = self.browser.execute_script(f"return document.querySelectorAll('picture > img')[{num}]").get_attribute('src')

There are 24 images on page so I iterate through them (via num ). If I debug line by line, several urls render correctly, however, if I just let the code go with no breakpoints I get all urls like this:

data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7 .

I tried to base64 decode it, but it makes no sense to me. And it is also too short to be actual image. Correctly rendered urls show that the images are actually not gifs but jpgs.

I also tried to find the element by css selector (using both pure beautifulsoup and selenium) but the result was the same.

I found this discussion: How to extract img src from web page via lxml in beautifulsoup using python? but it did not help me either. I have not found any dynamic key (although there are similarities - there are multiple sizes of the pictures) and the base64 code is too short to be an actual image preview as mentioned above.

If I inspect element in browser I see correct url. Is there a way I can do the same using some beautiful soup or selenium (or other python framework for scraping)? What is the actual data encoded in base64?

If you look at the source code of the website,The images links you are trying to scrape exists in another tag noscript .

you can get them using requests and Beautifulsoup as follows:

import requests
from bs4 import BeautifulSoup as bs
url = 'https://eshop.nobilis.cz/aromaterapie/'
res = requests.get(url,headers={'User-Agent': 'Mozilla/5.0'})

soup = bs(res.content, 'html.parser')

images = soup.select('noscript img')
for img in images:
        img_link = img.get('src')
        img_alt  = img.get('alt')
        print(img_alt , '==>' , img_link)

Output:

Obrázek kategorie Aromaterapie ==> https://cdn.nobilis.cz/image/custom-w1920-h480-crop/content/aromaterapie_3840x960-bb98d24ff24a2c55.jpg
Keramický difuzér ==> https://cdn.nobilis.cz/image/custom-w225-h250/data/persistent/products/4/6/33/keramicky-difuzer__S8Ru.jpg
Keramická destička ==> https://cdn.nobilis.cz/image/custom-w225-h250/data/persistent/products/4/5/31/n1700-kopie__nQwF.jpg
Aroma difuzér ==> https://cdn.nobilis.cz/image/custom-w225-h250/data/persistent/products/4/6/57/t0328-aroma-difuzer__JYKy.jpg
MINI difuzér ==> https://cdn.nobilis.cz/image/custom-w225-h250/data/persistent/products/5/6/86/01-t0330-mini-difuzer__9RjF.jpg
Zen difuzér ==> https://cdn.nobilis.cz/image/custom-w225-h250/data/persistent/products/5/3/20/t0329-zen-difuzer__IBcR.jpg
Náplně do MINI difuzéru 10 ks ==> https://cdn.nobilis.cz/image/custom-w225-h250/data/persistent/products/4/5/7/t0331s__IqbM.jpg
Aromaterapie na cesty ==> https://cdn.nobilis.cz/image/custom-w225-h250/data/persistent/products/4/5/55/s0103-aromaterapie-na-cesty__0hat.jpg
Keramická amforka ==> https://cdn.nobilis.cz/image/custom-w225-h250/data/persistent/products/5/7/13/keramicka-amforka-kopie__bpFN.jpg
Prostorový difuzér éterických olejů ==> https://cdn.nobilis.cz/image/custom-w225-h250/data/persistent/products/4/7/59/t0320__egh5.jpg
Směs éterických olejů Inspirace ==> https://cdn.nobilis.cz/image/custom-w225-h250/data/persistent/products/7/0/57/e1081b-smes-eterickych-oleju-inspirace__YAb1.jpg
Směs éterických olejů Tantra ==> https://cdn.nobilis.cz/image/custom-w225-h250/data/persistent/products/6/4/63/e2006b-smes-eterickych-oleju-tantra__KeIG.jpg
Éterický olej bio Citron ==> https://cdn.nobilis.cz/image/custom-w225-h250/data/persistent/products/7/4/59/b0015b-bio-citron__KvPJ.jpg
Éterický olej Meduňka ==> https://cdn.nobilis.cz/image/custom-w225-h250/data/persistent/products/6/6/94/e1027-medunka-1-ml__svsg.jpg
Éterický olej Bergamot ==> https://cdn.nobilis.cz/image/custom-w225-h250/data/persistent/products/7/0/27/e0008b-etericky-olej-bergamot__gab2.jpg
Éterický olej Grapefruit ==> https://cdn.nobilis.cz/image/custom-w225-h250/data/persistent/products/6/8/71/e0024b-etericky-olej-grapefruit__J85r.jpg
Éterický olej bio Rozmarýn ==> https://cdn.nobilis.cz/image/custom-w225-h250/data/persistent/products/5/8/12/b0016b-bio-rozmaryn__POvK.jpg
Směs éterických olejů Druhý dech ==> https://cdn.nobilis.cz/image/custom-w225-h250/data/persistent/products/7/3/27/e2002b-smes-eterickych-oleju-druhy-dech__dPzL.jpg
Éterický olej Šalvěj muškátová ==> https://cdn.nobilis.cz/image/custom-w225-h250/data/persistent/products/5/8/93/e0045b-etericky-olej-salvej-muskatova__wAFx.jpg
Éterický olej Cypřiš ==> https://cdn.nobilis.cz/image/custom-w225-h250/data/persistent/products/6/3/55/e0017b-etericky-olej-cypris__RxDS.jpg
Éterický olej Skořice, kůra ==> https://cdn.nobilis.cz/image/custom-w225-h250/data/persistent/products/7/0/60/e0074b-etericky-olej-skorice-kura__tK0h.jpg
Éterický olej Geranium ==> https://cdn.nobilis.cz/image/custom-w225-h250/data/persistent/products/6/2/71/e1057b-etericky-olej-geranium__dCRQ.jpg
Éterický olej Konopí ==> https://cdn.nobilis.cz/image/custom-w225-h250/data/persistent/products/5/7/67/e0154h-konopi-1-ml__b2oW.jpg
Růže v jojobovém oleji ==> https://cdn.nobilis.cz/image/custom-w225-h250/data/persistent/products/6/7/24/n1010c-ruze-v-jojobe-20-ml__jzLM.jpg
Éterický olej bio Tymián linalol ==> https://cdn.nobilis.cz/image/custom-w225-h250/data/persistent/products/6/3/82/b0005a-bio-tymian-linalol__8IFa.jpg

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM