简体   繁体   English

网页抓取图像 URL 返回为 ''

[英]Webscraping image URLs returning as ''

I think my issue is with javascript running on the page and not loading the images until i scroll down.我认为我的问题是在页面上运行 javascript 并且在我向下滚动之前不加载图像。 could anyone help me out with this?有人可以帮我解决这个问题吗? the script works fine until i hit the "ZendikarRising(ZNR)" where there are a lot more images on the page.脚本工作正常,直到我点击页面上有更多图像的“ZendikarRising(ZNR)”。 i'm then given Failed to save imageMakindi Ox (ZNR).png from url... it should say a URL but it's returning '' there are some DEBUG codes i have incorporated to bypass missing card URLs but I'm missing tons.然后我收到了无法从 url 保存 imageMakindi Ox (ZNR).png 的消息...它应该说一个 URL 但它正在返回“我已经合并了一些调试代码来绕过丢失的卡 URL,但我缺少很多。

I've tried removing empty fields but if you run it you can see i have even number of card names and URLs (some of which are blank) so removing the empty URLs would throw off the totals and would result in me missing cards from the set.我试过删除空字段,但如果你运行它,你可以看到我有偶数个卡片名称和 URL(其中一些是空白的),所以删除空 URL 会丢掉总数,并导致我从放。

Here is the code in question这是有问题的代码

import requests
import os
from os.path import basename
from bs4 import BeautifulSoup
 
path = os.getcwd()
print ("The current working directory is %s" % path)
 
url = 'https://scryfall.com/sets'
r=requests.get(url).text
soup = BeautifulSoup(r, 'html.parser')
 
####################GATHERS ALL URLS FROM SET DIRECTORY#####################
links = []
Urls = []
for link in soup.findAll('a'):
    links.append(link.get('href'))
 
for link in links:
    if link != None:
        if 'https://scryfall.com/sets/' in link:
            if link not in Urls:
                Urls.append(link)
 
#################START OF ALL URL LOOPS################################
for Url in Urls: ##goes threw all the URLS gathered from the sets links
    r=requests.get(Url).text
    soup = BeautifulSoup(r, 'html.parser')
 
    temp = soup.find('h1', {'class': 'set-header-title-h1'}).contents
    temp = ''.join(temp)
    temp = temp.strip()
    temp = temp.replace(':', '')
    temp = temp.replace(' ', '')
 
    test2 = (f"{path}\\{temp}")
#############################################MAKE DIRECTORY FOR SET FOLDERS##################
    try:
        os.mkdir(test2)
    except OSError:
        print ("Creation of the directory %s failed" % test2)
    else:
        print ("Successfully created the directory %s " % test2)
 
############################################GATHER ALL IMAGES####################
    images = soup.find_all('img')
 
    pictures = [] ##stores all the picture URLS
    names = [] ##stores all the name
 
    for image in images[:-1]:
        names.append(image.get('alt'))
        pictures.append(image.get('src'))
####################SAVES ALL IMAGES AS FILES#################
 
    x=0
    for i in pictures:
        fn = names[x] + '.png'
        try:
            with open(f'{test2}\\'+basename(fn),"wb") as f:
 
                f.write(requests.get(i).content)
                f.close
                ##print(i)
                ##print(f'saved {fn} to {path}')
                x+=1
        except OSError:
            print(f"Failed to save image{fn} from url{i}")
            print(len(pictures))
            print(len(names))
            exit()
##################RESETS IMAGES AND NAMES FOR NEXT SET FOLDER#############
 
    pictures.clear()
    names.clear()
Print("Completed With No Errors")

Indeed the images are lazy loaded by a JS script and though you find no <img> tags with src attributes later in the page.实际上,图像是由 JS 脚本延迟加载的,尽管稍后在页面中您找不到带有src属性的<img>标记。

However the solution is pretty simple.然而,解决方案非常简单。 If you look at several <img> tags that are not loaded, you will see that the image link is not present in the src attribute, but rather in the data-src attribute.如果您查看几个未加载的<img>标签,您将看到图像链接不在src属性中,而是在data-src属性中。

For example:例如:

<img alt="Wayward Guide-Beast (ZNR)" class="card znr border-black" data-component="lazy-image" data-src="https://c1.scryfall.com/file/scryfall-cards/normal/front/e/b/ebfe94fc-7a98-4f53-8fd0-f5fd016b1873.jpg?1599472001" src="" title="Wayward Guide-Beast (ZNR)"/>

So all you have to do is check whether src is empty and if so scrape the data-src attribute.所以你所要做的就是检查src是否为空,如果是,则刮掉data-src属性。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM