简体   繁体   English

使用 beautifulsoup 抓取图像时出错

[英]Error while scraping image with beautifulsoup

The original code is here : https://github.com/amitabhadey/Web-Scraping-Images-using-Python-via-BeautifulSoup-/blob/master/code.py原始代码在这里: https : //github.com/amitabhadey/Web-Scraping-Images-using-Python-via-BeautifulSoup-/blob/master/code.py

So i am trying to adapt a Python script to collect pictures from a website to get better at web scraping.因此,我正在尝试调整 Python 脚本以从网站收集图片,以便更好地进行网络抓取。

I tried to get images from " https://500px.com/editors "我试图从“ https://500px.com/editors ”获取图像

The first error was第一个错误是

The code that caused this warning is on line 12 of the file/Bureau/scrapper.py.导致此警告的代码位于文件/Bureau/scrapper.py 的第 12 行。 To get rid of this warning, pass the additional argument 'features="lxml"' to the BeautifulSoup constructor.要消除此警告,请将附加参数 'features="lxml"' 传递给 BeautifulSoup 构造函数。

So I did :所以我做了 :

soup = BeautifulSoup(plain_text, features="lxml")

I also adapted the class to reflect the tag in 500px.我还调整了类以反映 500px 中的标签。

But now the script stopped running and nothing happened.但是现在脚本停止运行,什么也没发生。

In the end it looks like this :最后它看起来像这样:

import requests 
from bs4 import BeautifulSoup 
import urllib.request
import random 

url = "https://500px.com/editors"

source_code = requests.get(url)

plain_text = source_code.text

soup = BeautifulSoup(plain_text, features="lxml")

for link in soup.find_all("a",{"class":"photo_link "}):
    href = link.get('href')
    print(href)

    img_name = random.randrange(1,500)

    full_name = str(img_name) + ".jpg"

    urllib.request.urlretrieve(href, full_name)

    print("loop break")

What did I do wrong?我做错了什么?

Actually the website is loaded via JavaScript using XHR request to the following API实际上,该网站是通过JavaScript使用XHR请求对以下API 加载的

So you can reach it directly via API .所以你可以通过API直接访问它。

Note that you can increase parameter rpp=50 to any number as you want for getting more than 50 result.请注意,您可以根据需要将参数rpp=50增加到任何数字,以获得超过50结果。

import requests

r = requests.get("https://api.500px.com/v1/photos?rpp=50&feature=editors&image_size%5B%5D=1&image_size%5B%5D=2&image_size%5B%5D=32&image_size%5B%5D=31&image_size%5B%5D=33&image_size%5B%5D=34&image_size%5B%5D=35&image_size%5B%5D=36&image_size%5B%5D=2048&image_size%5B%5D=4&image_size%5B%5D=14&sort=&include_states=true&include_licensing=true&formats=jpeg%2Clytro&only=&exclude=&personalized_categories=&page=1&rpp=50").json()

for item in r['photos']:
    print(item['url'])

also you can access the image url itself in order to write it directly!您也可以访问图像url本身以直接编写它!

import requests

r = requests.get("https://api.500px.com/v1/photos?rpp=50&feature=editors&image_size%5B%5D=1&image_size%5B%5D=2&image_size%5B%5D=32&image_size%5B%5D=31&image_size%5B%5D=33&image_size%5B%5D=34&image_size%5B%5D=35&image_size%5B%5D=36&image_size%5B%5D=2048&image_size%5B%5D=4&image_size%5B%5D=14&sort=&include_states=true&include_licensing=true&formats=jpeg%2Clytro&only=&exclude=&personalized_categories=&page=1&rpp=50").json()

for item in r['photos']:
    print(item['image_url'][-1])

Note that image_url key hold different img size.请注意, image_url key保存不同的img大小。 so you can choose your preferred one and save it.所以你可以选择你喜欢的一个并保存它。 here I've taken the big one.在这里,我采取了大的。

Saving directly:直接保存:

import requests

with requests.Session() as req:
    r = req.get("https://api.500px.com/v1/photos?rpp=50&feature=editors&image_size%5B%5D=1&image_size%5B%5D=2&image_size%5B%5D=32&image_size%5B%5D=31&image_size%5B%5D=33&image_size%5B%5D=34&image_size%5B%5D=35&image_size%5B%5D=36&image_size%5B%5D=2048&image_size%5B%5D=4&image_size%5B%5D=14&sort=&include_states=true&include_licensing=true&formats=jpeg%2Clytro&only=&exclude=&personalized_categories=&page=1&rpp=50").json()
    result = []
    for item in r['photos']:
        print(f"Downloading {item['name']}")
        save = req.get(item['image_url'][-1])
        name = save.headers.get("Content-Disposition")[9:]
        with open(name, 'wb') as f:
            f.write(save.content)

Looking at the page you're trying to scrape I noticed something.看着你试图抓取的页面,我注意到了一些东西。 The data doesn't appear to load until a few moments after the page finishes loading.直到页面完成加载几分钟后,数据才会加载。 This tells me that they're using a JS framework to load the images after page load.这告诉我他们正在使用 JS 框架在页面加载后加载图像。

Your scraper will not work with this page due to the fact that it does not run JS on the pages it's pulling.你刮刀将与此页面,是因为这样的事实,它并不在它拉页运行JS不工作。 Running your script and printing out what plain_text contains proves this:运行您的脚本并打印出plain_text包含的内容证明了这一点:

<a class='photo_link {{#if hasDetailsTooltip}}px_tooltip{{/if}}' href='{{photoUrl}}'>

If you look at the href attribute on that tag you'll see it's actually a templating tag used by JS UI frameworks.如果您查看该标签上的href属性,您会发现它实际上是 JS UI 框架使用的模板标签。

Your options now are to either see what APIs they're calling to get this data (check the inspector in your web browser for network calls, if you're lucky they may not require authentication) or to use a tool that runs JS on pages.您现在的选择是查看他们正在调用哪些 API 来获取这些数据(检查网络浏览器中的检查器是否有网络调用,如果幸运的话,它们可能不需要身份验证)或使用在页面上运行 JS 的工具. One tool I've seen recommended for this is selenium , though I've never used it so I'm not fully aware of its capabilities;我见过为此推荐的一种工具是selenium ,尽管我从未使用过它,所以我不完全了解它的功能; I imagine the tooling around this would drastically increase the complexity of what you're trying to do.我想围绕这个的工具会大大增加你想要做的事情的复杂性。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM