簡體   English   中英

從中提取數據<script> BeautifulSoup Python

[英]Extract data from <script> BeautifulSoup Python

我有這個代碼:

product_url = 'https://www.burton.com/us/en/p/burton-elite-long-sleeve-tshirt/W21-203921.html?cgid=womens-tees'
res = requests.get(product_url, headers=headers)
soup = BeautifulSoup(res.text, 'html.parser')
product = soup.find('main', {'id': 'main-content'})
details = product.find('script')
data = json.loads(details.string)

這給出了這個輸出:

<script>
        __metadata.product = {
            id: "W21-203921",
            sku: "W21-203921",
            ph1: 'SOFTGOODS',
            ph2: 'BASIC FLEECE AND TEE',
            ph3: 'LS TEES',
            ph4: '',
            upc: '190450612509',
            ean: '9009521451408',
            brand: "Burton",
            category: "womens-tees",
            primaryCategory: "womens-sale-sweaters-shirts",
            currency: "USD",
            gender: "Unisex",
            label: "Burton Elite Long Sleeve T-Shirt",
            name: "Burton Elite Long Sleeve T-Shirt"
        };
        __metadata.criteo = {
            pageType: 'ProductPage'
        };
    </script>

現在我想提取其中的一些數據,如 ID、品牌、類別和名稱。

我已經查看了這個論壇上幾乎所有主題非常相似的問題,並嘗試了他們的解決方案,但沒有任何效果。 他們中的大多數人以各種方式按照 data = json.loads(details) 的方式做一些事情,但似乎沒有一個工作。 我得到的最常見的錯誤是:

json.decoder.JSONDecodeError: Expecting value: line 2 column 9 (char 9)

TypeError: the JSON object must be str, bytes or bytearray, not Tag

我將下面的答案留給后人,但這種方法更好。 這個故事的寓意:檢查 XHR 請求,看看您是否可以通過使用他們的 API 來完全繞過字符串解析。


正如我在評論中所寫,您可以對這些數據做出許多不同的假設,並且您可以使用同樣多的策略來提取它。

您使用哪種取決於許多因素:這種數據格式可能會改變嗎? 這是一次性刮擦還是您需要盡可能多地適應未來修改的東西? 如果是后者,根據您對站點的了解,未來哪些修改最有可能?

鑒於這些問題沒有得到解決,我假設您只想盡可能簡單地將其解析為 dict,而無需做出各種面向未來的假設。

您可以使用:

import json
import re

chunk = re.search(r"\{[^}]+", html).group().replace("'", '"')
data = json.loads(re.sub(r"(\w+):", r'"\1":', chunk) + "}")

假設 JS 對象中沒有大括號,字符串中沒有冒號等。


作為上述警告的一個示例,OP 回復說關鍵label: "Girls' Burton Chicklet Flat Top Snowboard",破壞了正則表達式,因為其中有一個'被替換為未轉義的"

對於這種情況,可以通過假設'在同一行上沒有跟在"后面來解決這個問題:

chunk = re.sub(r'\'(?![^\n"]*")', '"', re.search(r"\{[^}]+", html).group())
data = json.loads(re.sub(r"(\w+):", r'"\1":', chunk) + "}")

...但這只是用另一組假設替換了一組假設,而且很容易編造一個打破這種模式的場景。 如果用例正在抓取數百萬個產品,幾乎不可避免地會出現一些意想不到的情況,這里顯示的模式需要進一步調整。 這篇文章是一個概念驗證,不能聲稱解析任意格式,因此讀者可以做進一步調整的練習。

僅從 ajax 格式獲取數據會更容易和更健壯。 只需將其添加到params參數中即可。 然后你可以從 json 格式/字典中取出你想要的任何東西。 也適用於您在評論中提供的其他網址。

import requests

url = 'https://www.burton.com/us/en/p/burton-elite-long-sleeve-tshirt/W21-203921.html?cgid=womens-tees'
payload = {'format':'ajax'}

jsonData = requests.get(url, params=payload).json()

輸出:

print(jsonData['data']['products'][0])
{'id': 'W21-203921', 'hideOutOfStockVariants': True, 'brand': 'Burton', 'name': 'Burton Elite Long Sleeve T-Shirt', 'subtitle': '100% Organic Cotton Long Sleeve Graphic T Shirt', 'shortDescription': "A comfortable long sleeve T-shirt that's an unsung favorite for social hour and Sunday in the park.", 'gender': 'Unisex', 'season': 'W21', 'isBoard': False, 'hasSizeChart': True, 'hasSizeFinder': False, 'selectedVariations': {'variationColor': '', 'variationSize': ''}, 'links': {'master': 'https://www.burton.com/us/en/p/burton-elite-long-sleeve-tshirt/W21-203921.html', 'variations': '/on/demandware.store/Sites-Burton_NA-Site/en_US/Product-GetVariationJSON?pid=W21-203921', 'manual': '/us/en/help/manuals.html', 'yotpoAPI': 'https://api.yotpo.com/v1/widget/AbBl1exDWS4rzXsg73rzUKlzUOo10aeMXRkIGHVG/products/W21-203921/reviews?per_page=0', 'tech': '/on/demandware.store/Sites-Burton_NA-Site/en_US/Product-GetTechFeaturesJSON?pids=W21-203921', 'recommendations': '/on/demandware.store/Sites-Burton_NA-Site/en_US/Product-GetRecommendationsJSON?pids=W21-203921', 'ultimateSetup': '/on/demandware.store/Sites-Burton_NA-Site/en_US/Product-GetRecommendationsJSON?pids=W21-203921', 'dynamicslots': '/on/demandware.store/Sites-Burton_NA-Site/en_US/Slot-GetDynamicSlots?pid=W21-203921'}, 'variationValueCount': {'variationColor': 4, 'variationSize': 7}, 'finePrint': [], 'images': {'type': 'PRODUCT_LEVEL', 'views': [{'id': '_4U', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392102001_4U.png'}}, {'id': '_3W', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392102001_3W.png'}}, {'id': '_4M', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392102001_4M.png'}}, {'id': '_5M', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392102001_5M.png'}}, {'id': '_6W', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392102001_6W.png'}}, {'id': '_1', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392102001_1.png'}}], 'variationImageData': [{'variationColorID': '20392102001', 'display': {'category': {'primary': '_4U', 'focus': '_1'}}, 'views': [{'id': '_4U', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392102001_4U.png'}}, {'id': '_3W', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392102001_3W.png'}}, {'id': '_4M', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392102001_4M.png'}}, {'id': '_5M', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392102001_5M.png'}}, {'id': '_6W', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392102001_6W.png'}}, {'id': '_1', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392102001_1.png'}}]}, {'variationColorID': '20392102300', 'display': {'category': {'primary': '_4U', 'focus': '_1'}}, 'views': [{'id': '_4U', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392102300_4U.png'}}, {'id': '_3M', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392102300_3M.png'}}, {'id': '_4W', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392102300_4W.png'}}, {'id': '_5M', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392102300_5M.png'}}, {'id': '_6W', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392102300_6W.png'}}, {'id': '_1', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392102300_1.png'}}]}, {'variationColorID': '20392103200', 'display': {'category': {'primary': '_4'}}, 'views': [{'id': '_4', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392103200_4.png'}}, {'id': '_3', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392103200_3.png'}}, {'id': '_5', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392103200_5.png'}}, {'id': '_6', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392103200_6.png'}}]}, {'variationColorID': '20392103400', 'display': {'category': {'primary': '_3'}}, 'views': [{'id': '_3', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392103400_3.png'}}, {'id': '_4', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392103400_4.png'}}, {'id': '_5', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392103400_5.png'}}, {'id': '_6', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392103400_6.png'}}]}]}, 'ean': '9009521451408', 'upc': '190450612509', 'badges': '', 'category': 'womens-tees', 'primaryCategory': 'womens-sale-sweaters-shirts', 'hasUltimateSetup': False, 'ph1': 'SOFTGOODS', 'ph2': 'BASIC FLEECE AND TEE', 'ph3': 'LS TEES', 'ph4': '', 'videoID': '', 'videoPoster': '', 'videoVertical': '', 'spectrumObjects': False, 'scrollingText': False, 'cartSpecialCalloutMessage': False, 'disableEcommerce': False}

更新:

要獲取價格和庫存,您需要從第一個響應中提取產品 ID,然后發出新請求:

import requests
import pandas as pd

urls = ['https://www.burton.com/us/en/p/burton-elite-long-sleeve-tshirt/W21-203921.html?cgid=womens-tees','https://www.burton.com/us/en/p/girls-burton-chicklet-flat-top-snowboard/W21-107341.html']
payload = {'format':'ajax'}

productID_list = []
for url in urls:
    jsonData = requests.get(url, params=payload).json()
    productID = jsonData['data']['masterID']
    productID_list.append(productID)


stock = []
for productID in productID_list:
    prod_url = 'https://www.burton.com/on/demandware.store/Sites-Burton_NA-Site/en_US/Product-GetVariationJSON'
    payload = {'pid':productID,
               'pricing':''}
    productData = requests.get(prod_url, params=payload).json()
    
    
    for each in productData['data']['variations']['variationValues']:
        row = {}
        row['name'] = each['name']
        row['color'] = each['variationColor']['displayName']
        row['size'] = each['variationSize']['displayName']
        row['standard_price'] = each['price']['standardPriceUnformatted']
        row['sale_price'] = each['price']['salePriceUnformatted']
        row['isOnSale'] = each['price']['isOnSale']
        row['available'] = each['status']['available']
        row['inStock'] = each['status']['meta']['type']
        
        stock.append(row)
    
df = pd.DataFrame(stock)    

輸出:

print (df.to_string())
                                         name          color size standard_price sale_price  isOnSale  available        inStock
0            Burton Elite Long Sleeve T-Shirt     True Black    L          39.95                False       True       IN_STOCK
1            Burton Elite Long Sleeve T-Shirt     True Black    M          39.95                False       True       IN_STOCK
2            Burton Elite Long Sleeve T-Shirt     True Black    S          39.95                False       True       IN_STOCK
3            Burton Elite Long Sleeve T-Shirt     True Black   XL          39.95                False       True       IN_STOCK
4            Burton Elite Long Sleeve T-Shirt     True Black   XS          39.95                False       True       IN_STOCK
5            Burton Elite Long Sleeve T-Shirt     True Black  XXL          39.95                False       True       IN_STOCK
6            Burton Elite Long Sleeve T-Shirt     True Black  XXS          39.95                False       True       IN_STOCK
7            Burton Elite Long Sleeve T-Shirt  Martini Olive    L          39.95                False       True       IN_STOCK
8            Burton Elite Long Sleeve T-Shirt  Martini Olive    M          39.95                False       True       IN_STOCK
9            Burton Elite Long Sleeve T-Shirt  Martini Olive    S          39.95                False       True       IN_STOCK
10           Burton Elite Long Sleeve T-Shirt  Martini Olive   XL          39.95                False       True       IN_STOCK
11           Burton Elite Long Sleeve T-Shirt  Martini Olive   XS          39.95                False       True       IN_STOCK
12           Burton Elite Long Sleeve T-Shirt  Martini Olive  XXL          39.95                False       True       IN_STOCK
13           Burton Elite Long Sleeve T-Shirt  Martini Olive  XXS          39.95                False       True       IN_STOCK
14           Burton Elite Long Sleeve T-Shirt     True Penny    L          39.95      27.96      True       True       IN_STOCK
15           Burton Elite Long Sleeve T-Shirt     True Penny    M          39.95      27.96      True       True       IN_STOCK
16           Burton Elite Long Sleeve T-Shirt     True Penny    S          39.95      27.96      True      False      BACKORDER
17           Burton Elite Long Sleeve T-Shirt     True Penny   XL          39.95      27.96      True       True       IN_STOCK
18           Burton Elite Long Sleeve T-Shirt     True Penny   XS          39.95      27.96      True      False  NOT_AVAILABLE
19           Burton Elite Long Sleeve T-Shirt     True Penny  XXL          39.95      27.96      True       True       IN_STOCK
20           Burton Elite Long Sleeve T-Shirt     True Penny  XXS          39.95      27.96      True       True       IN_STOCK
21           Burton Elite Long Sleeve T-Shirt     Lapis Blue    L          39.95      27.96      True      False  NOT_AVAILABLE
22           Burton Elite Long Sleeve T-Shirt     Lapis Blue    M          39.95      27.96      True      False      BACKORDER
23           Burton Elite Long Sleeve T-Shirt     Lapis Blue    S          39.95      27.96      True      False      BACKORDER
24           Burton Elite Long Sleeve T-Shirt     Lapis Blue   XL          39.95      27.96      True      False  NOT_AVAILABLE
25           Burton Elite Long Sleeve T-Shirt     Lapis Blue   XS          39.95      27.96      True       True       IN_STOCK
26           Burton Elite Long Sleeve T-Shirt     Lapis Blue  XXL          39.95      27.96      True      False      BACKORDER
27           Burton Elite Long Sleeve T-Shirt     Lapis Blue  XXS          39.95      27.96      True       True       IN_STOCK
28  Girls' Burton Chicklet Flat Top Snowboard             80   80         199.95                False      False      BACKORDER
29  Girls' Burton Chicklet Flat Top Snowboard             90   90         199.95                False      False      BACKORDER
30  Girls' Burton Chicklet Flat Top Snowboard            100  100         199.95                False      False      BACKORDER
31  Girls' Burton Chicklet Flat Top Snowboard            110  110         199.95                False      False      BACKORDER
32  Girls' Burton Chicklet Flat Top Snowboard            115  115         199.95                False      False      BACKORDER
33  Girls' Burton Chicklet Flat Top Snowboard            120  120         199.95                False       True       IN_STOCK
34  Girls' Burton Chicklet Flat Top Snowboard            125  125         199.95                False       True       IN_STOCK
35  Girls' Burton Chicklet Flat Top Snowboard            130  130         199.95                False       True       IN_STOCK

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM