简体   繁体   English

如果不是http 200状态,如何比较变量

[英]How to compare variables if not http 200 status

I have currently written a webscraping where I compare two values to see if there has been any increased value from previous request compare to new request.我目前已经写了一个网页抓取,我比较了两个值,看看与新请求相比,以前的请求是否有任何增加的值。

import json
import re
import time
from dataclasses import dataclass
from typing import Optional, List

import requests
from bs4 import BeautifulSoup


@dataclass
class Product:
    name: Optional[str]
    price: Optional[str]
    image: Optional[str]
    sizes: List[str]

    @staticmethod
    def get_sizes(doc: BeautifulSoup) -> List[str]:
        pat = re.compile(
            r'^<script>var JetshopData='
            r'(\{.*\})'
            r';</script>$',
        )
        for script in doc.find_all('script'):
            match = pat.match(str(script))
            if match is not None:
                break
        else:
            return []

        data = json.loads(match[1])
        return [
            variation
            for get_value in data['ProductInfo']['Attributes']['Variations']
            if get_value.get('IsBuyable')
            for variation in get_value['Variation']
        ]

    @classmethod
    def from_page(cls, url: str) -> Optional['Product']:
        with requests.get(url) as response:
            response.raise_for_status()
            doc = BeautifulSoup(response.text, 'html.parser')

        name = doc.select_one('h1.product-page-header')
        price = doc.select_one('span.price')
        image = doc.select_one('meta[property="og:image"]')

        return cls(
            name=name and name.text.strip(),
            price=price and price.text.strip(),
            image=image and image['content'],
            sizes=cls.get_sizes(doc),
        )


def main():
    product = Product.from_page("https://shelta.se/sneakers/nike-air-zoom-type-whiteblack-cj2033-103")

    previous_request = product.sizes

    while True:
        product = Product.from_page("https://shelta.se/sneakers/nike-air-zoom-type-whiteblack-cj2033-103")

        if set(product.sizes) - set(previous_request):
            print("new changes on the webpage")
            previous_request = product.sizes

        else:
            print("No changes made")

        time.sleep(500)


if __name__ == '__main__':
    main()

The problem I am facing is that there is a scenario where the product can be taken down.我面临的问题是,有一种产品可以被下架的场景。 For example if I now have found sizes ['US 9,5/EUR 43', 'US 10,5/EUR 44,5'] and the webpage gets taken down by the admin where it returns 404. After few hours they re-add back the webpage and add again the values ['US 9,5/EUR 43', 'US 10,5/EUR 44,5'] - That would not print the value we already had it before on our previous valid request.例如,如果我现在找到了尺寸['US 9,5/EUR 43', 'US 10,5/EUR 44,5']并且网页被管理员删除并返回 404。几个小时后他们重新- 添加回网页并再次添加值['US 9,5/EUR 43', 'US 10,5/EUR 44,5'] - 这不会打印我们之前在之前的有效请求中已经拥有的值.

I wonder what would be the best way to print out the values if a webpage returns from 404 back to 200 (even if they add the same value?)我想知道如果网页从 404 返回到 200(即使它们添加相同的值?)

The use of response.raise_for_status() is incorrect in this case.在这种情况下, response.raise_for_status()的使用是不正确的。 That will simply raise an exception if the website returns a 404, 500 or similar, exiting your program.如果网站返回 404、500 或类似信息,退出程序,则只会引发异常。 change out response.raise_for_status() with:更改response.raise_for_status()为:

if response.status_code is not 200:
    return cls(None,None,None,None)

EDIT as i misinterpreted the question:编辑,因为我误解了这个问题:

An empty product will now be returned if an error occurred.如果发生错误,现在将返回空产品。 The only check required now is if the sizes has changed.现在唯一需要检查的是尺寸是否发生了变化。

def main():
    url = "https://shelta.se/sneakers/nike-air-zoom-type-whiteblack-cj2033-103"

    previous_product = Product.from_page(url) 
    while True:
        product = Product.from_page(url)
        
        if not product.sizes == previous_product.sizes:
            print("new changes on the webpage")
        else:
            print("No changes made")
        
        previous_product = product
        time.sleep(500)

previous_product has been moved outside. previous_product已移到外面。 In this exact case, it does not matter, but it improves readability.在这种确切的情况下,这无关紧要,但它提高了可读性。

The use of set(...) - set(...) has been removed as it does not catch when something has been removed from the website, only when something is added. set(...) - set(...)已被删除,因为它不会在某些内容从网站上删除时捕获,只有在添加某些内容时才捕获。 If something is first removed and then re-added, it would be have been caught by your program either.如果先删除然后重新添加某些内容,它也会被您的程序捕获。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM