简体   繁体   中英

Extract data from <script> BeautifulSoup Python

I have this code:

product_url = 'https://www.burton.com/us/en/p/burton-elite-long-sleeve-tshirt/W21-203921.html?cgid=womens-tees'
res = requests.get(product_url, headers=headers)
soup = BeautifulSoup(res.text, 'html.parser')
product = soup.find('main', {'id': 'main-content'})
details = product.find('script')
data = json.loads(details.string)

which gives this output:

<script>
        __metadata.product = {
            id: "W21-203921",
            sku: "W21-203921",
            ph1: 'SOFTGOODS',
            ph2: 'BASIC FLEECE AND TEE',
            ph3: 'LS TEES',
            ph4: '',
            upc: '190450612509',
            ean: '9009521451408',
            brand: "Burton",
            category: "womens-tees",
            primaryCategory: "womens-sale-sweaters-shirts",
            currency: "USD",
            gender: "Unisex",
            label: "Burton Elite Long Sleeve T-Shirt",
            name: "Burton Elite Long Sleeve T-Shirt"
        };
        __metadata.criteo = {
            pageType: 'ProductPage'
        };
    </script>

Now I want to extract some of this data like id, brand, category, and name.

I have looked at pretty much every thread on this forum with very similar questions and tried their solution, and nothing ever works. Most of them do something along the lines of data = json.loads(details) in various ways and none of them seems to work. The most common errors I get are:

json.decoder.JSONDecodeError: Expecting value: line 2 column 9 (char 9)

or

TypeError: the JSON object must be str, bytes or bytearray, not Tag

I'm leaving the below answer for posterity, but this approach is better. Moral of the story: check the XHR requests and see if you can circumvent the string parsing entirely by working with their API.


As I wrote in a comment, there are so many different assumptions you could make about this data and equally many strategies you could use to extract it.

Which you use depends on many factors: is this data format likely to change? Is it a one-off scrape or something you need to be resilient to as many future modifications as possible? If the latter, which future modifications seem most likely based on your knowledge of the site?

Given that these questions weren't addressed, I assume you just want to parse it into a dict as simply as possible without making all sorts of futureproofing assumptions.

You can use:

import json
import re

chunk = re.search(r"\{[^}]+", html).group().replace("'", '"')
data = json.loads(re.sub(r"(\w+):", r'"\1":', chunk) + "}")

Which assumes no braces are within the JS object, no colons are within the strings, etc.


As an example of the above warning, OP has replied that the key label: "Girls' Burton Chicklet Flat Top Snowboard", breaks the regex because it has a ' in it that is replaced with an unescaped " .

This can be fixed for this case by assuming that the ' is not followed by a " on the same line:

chunk = re.sub(r'\'(?![^\n"]*")', '"', re.search(r"\{[^}]+", html).group())
data = json.loads(re.sub(r"(\w+):", r'"\1":', chunk) + "}")

...but this merely replaces one set of assumptions with another, and it's easy to concoct a scenario that breaks this pattern as well. If the use case is scraping millions of products, it's almost inevitable that something unanticipated will arise and the patterns shown here will need further adaptation. This post is a proof-of-concept and can't purport to parse arbitrary formats, so it's an exercise for the reader to make further adjustments.

It'll be far easier and more robust to just get the data from the ajax format. Just add that in to the params parameter. Then you can pull out whatever you want from the json format/dictionary. Works for the other url you provided in the comments too.

import requests

url = 'https://www.burton.com/us/en/p/burton-elite-long-sleeve-tshirt/W21-203921.html?cgid=womens-tees'
payload = {'format':'ajax'}

jsonData = requests.get(url, params=payload).json()

Output:

print(jsonData['data']['products'][0])
{'id': 'W21-203921', 'hideOutOfStockVariants': True, 'brand': 'Burton', 'name': 'Burton Elite Long Sleeve T-Shirt', 'subtitle': '100% Organic Cotton Long Sleeve Graphic T Shirt', 'shortDescription': "A comfortable long sleeve T-shirt that's an unsung favorite for social hour and Sunday in the park.", 'gender': 'Unisex', 'season': 'W21', 'isBoard': False, 'hasSizeChart': True, 'hasSizeFinder': False, 'selectedVariations': {'variationColor': '', 'variationSize': ''}, 'links': {'master': 'https://www.burton.com/us/en/p/burton-elite-long-sleeve-tshirt/W21-203921.html', 'variations': '/on/demandware.store/Sites-Burton_NA-Site/en_US/Product-GetVariationJSON?pid=W21-203921', 'manual': '/us/en/help/manuals.html', 'yotpoAPI': 'https://api.yotpo.com/v1/widget/AbBl1exDWS4rzXsg73rzUKlzUOo10aeMXRkIGHVG/products/W21-203921/reviews?per_page=0', 'tech': '/on/demandware.store/Sites-Burton_NA-Site/en_US/Product-GetTechFeaturesJSON?pids=W21-203921', 'recommendations': '/on/demandware.store/Sites-Burton_NA-Site/en_US/Product-GetRecommendationsJSON?pids=W21-203921', 'ultimateSetup': '/on/demandware.store/Sites-Burton_NA-Site/en_US/Product-GetRecommendationsJSON?pids=W21-203921', 'dynamicslots': '/on/demandware.store/Sites-Burton_NA-Site/en_US/Slot-GetDynamicSlots?pid=W21-203921'}, 'variationValueCount': {'variationColor': 4, 'variationSize': 7}, 'finePrint': [], 'images': {'type': 'PRODUCT_LEVEL', 'views': [{'id': '_4U', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392102001_4U.png'}}, {'id': '_3W', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392102001_3W.png'}}, {'id': '_4M', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392102001_4M.png'}}, {'id': '_5M', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392102001_5M.png'}}, {'id': '_6W', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392102001_6W.png'}}, {'id': '_1', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392102001_1.png'}}], 'variationImageData': [{'variationColorID': '20392102001', 'display': {'category': {'primary': '_4U', 'focus': '_1'}}, 'views': [{'id': '_4U', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392102001_4U.png'}}, {'id': '_3W', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392102001_3W.png'}}, {'id': '_4M', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392102001_4M.png'}}, {'id': '_5M', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392102001_5M.png'}}, {'id': '_6W', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392102001_6W.png'}}, {'id': '_1', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392102001_1.png'}}]}, {'variationColorID': '20392102300', 'display': {'category': {'primary': '_4U', 'focus': '_1'}}, 'views': [{'id': '_4U', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392102300_4U.png'}}, {'id': '_3M', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392102300_3M.png'}}, {'id': '_4W', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392102300_4W.png'}}, {'id': '_5M', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392102300_5M.png'}}, {'id': '_6W', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392102300_6W.png'}}, {'id': '_1', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392102300_1.png'}}]}, {'variationColorID': '20392103200', 'display': {'category': {'primary': '_4'}}, 'views': [{'id': '_4', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392103200_4.png'}}, {'id': '_3', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392103200_3.png'}}, {'id': '_5', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392103200_5.png'}}, {'id': '_6', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392103200_6.png'}}]}, {'variationColorID': '20392103400', 'display': {'category': {'primary': '_3'}}, 'views': [{'id': '_3', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392103400_3.png'}}, {'id': '_4', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392103400_4.png'}}, {'id': '_5', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392103400_5.png'}}, {'id': '_6', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392103400_6.png'}}]}]}, 'ean': '9009521451408', 'upc': '190450612509', 'badges': '', 'category': 'womens-tees', 'primaryCategory': 'womens-sale-sweaters-shirts', 'hasUltimateSetup': False, 'ph1': 'SOFTGOODS', 'ph2': 'BASIC FLEECE AND TEE', 'ph3': 'LS TEES', 'ph4': '', 'videoID': '', 'videoPoster': '', 'videoVertical': '', 'spectrumObjects': False, 'scrollingText': False, 'cartSpecialCalloutMessage': False, 'disableEcommerce': False}

Update:

To get price and stock, you need to pull out the product ID from that first response, then make a new request:

import requests
import pandas as pd

urls = ['https://www.burton.com/us/en/p/burton-elite-long-sleeve-tshirt/W21-203921.html?cgid=womens-tees','https://www.burton.com/us/en/p/girls-burton-chicklet-flat-top-snowboard/W21-107341.html']
payload = {'format':'ajax'}

productID_list = []
for url in urls:
    jsonData = requests.get(url, params=payload).json()
    productID = jsonData['data']['masterID']
    productID_list.append(productID)


stock = []
for productID in productID_list:
    prod_url = 'https://www.burton.com/on/demandware.store/Sites-Burton_NA-Site/en_US/Product-GetVariationJSON'
    payload = {'pid':productID,
               'pricing':''}
    productData = requests.get(prod_url, params=payload).json()
    
    
    for each in productData['data']['variations']['variationValues']:
        row = {}
        row['name'] = each['name']
        row['color'] = each['variationColor']['displayName']
        row['size'] = each['variationSize']['displayName']
        row['standard_price'] = each['price']['standardPriceUnformatted']
        row['sale_price'] = each['price']['salePriceUnformatted']
        row['isOnSale'] = each['price']['isOnSale']
        row['available'] = each['status']['available']
        row['inStock'] = each['status']['meta']['type']
        
        stock.append(row)
    
df = pd.DataFrame(stock)    

Output:

print (df.to_string())
                                         name          color size standard_price sale_price  isOnSale  available        inStock
0            Burton Elite Long Sleeve T-Shirt     True Black    L          39.95                False       True       IN_STOCK
1            Burton Elite Long Sleeve T-Shirt     True Black    M          39.95                False       True       IN_STOCK
2            Burton Elite Long Sleeve T-Shirt     True Black    S          39.95                False       True       IN_STOCK
3            Burton Elite Long Sleeve T-Shirt     True Black   XL          39.95                False       True       IN_STOCK
4            Burton Elite Long Sleeve T-Shirt     True Black   XS          39.95                False       True       IN_STOCK
5            Burton Elite Long Sleeve T-Shirt     True Black  XXL          39.95                False       True       IN_STOCK
6            Burton Elite Long Sleeve T-Shirt     True Black  XXS          39.95                False       True       IN_STOCK
7            Burton Elite Long Sleeve T-Shirt  Martini Olive    L          39.95                False       True       IN_STOCK
8            Burton Elite Long Sleeve T-Shirt  Martini Olive    M          39.95                False       True       IN_STOCK
9            Burton Elite Long Sleeve T-Shirt  Martini Olive    S          39.95                False       True       IN_STOCK
10           Burton Elite Long Sleeve T-Shirt  Martini Olive   XL          39.95                False       True       IN_STOCK
11           Burton Elite Long Sleeve T-Shirt  Martini Olive   XS          39.95                False       True       IN_STOCK
12           Burton Elite Long Sleeve T-Shirt  Martini Olive  XXL          39.95                False       True       IN_STOCK
13           Burton Elite Long Sleeve T-Shirt  Martini Olive  XXS          39.95                False       True       IN_STOCK
14           Burton Elite Long Sleeve T-Shirt     True Penny    L          39.95      27.96      True       True       IN_STOCK
15           Burton Elite Long Sleeve T-Shirt     True Penny    M          39.95      27.96      True       True       IN_STOCK
16           Burton Elite Long Sleeve T-Shirt     True Penny    S          39.95      27.96      True      False      BACKORDER
17           Burton Elite Long Sleeve T-Shirt     True Penny   XL          39.95      27.96      True       True       IN_STOCK
18           Burton Elite Long Sleeve T-Shirt     True Penny   XS          39.95      27.96      True      False  NOT_AVAILABLE
19           Burton Elite Long Sleeve T-Shirt     True Penny  XXL          39.95      27.96      True       True       IN_STOCK
20           Burton Elite Long Sleeve T-Shirt     True Penny  XXS          39.95      27.96      True       True       IN_STOCK
21           Burton Elite Long Sleeve T-Shirt     Lapis Blue    L          39.95      27.96      True      False  NOT_AVAILABLE
22           Burton Elite Long Sleeve T-Shirt     Lapis Blue    M          39.95      27.96      True      False      BACKORDER
23           Burton Elite Long Sleeve T-Shirt     Lapis Blue    S          39.95      27.96      True      False      BACKORDER
24           Burton Elite Long Sleeve T-Shirt     Lapis Blue   XL          39.95      27.96      True      False  NOT_AVAILABLE
25           Burton Elite Long Sleeve T-Shirt     Lapis Blue   XS          39.95      27.96      True       True       IN_STOCK
26           Burton Elite Long Sleeve T-Shirt     Lapis Blue  XXL          39.95      27.96      True      False      BACKORDER
27           Burton Elite Long Sleeve T-Shirt     Lapis Blue  XXS          39.95      27.96      True       True       IN_STOCK
28  Girls' Burton Chicklet Flat Top Snowboard             80   80         199.95                False      False      BACKORDER
29  Girls' Burton Chicklet Flat Top Snowboard             90   90         199.95                False      False      BACKORDER
30  Girls' Burton Chicklet Flat Top Snowboard            100  100         199.95                False      False      BACKORDER
31  Girls' Burton Chicklet Flat Top Snowboard            110  110         199.95                False      False      BACKORDER
32  Girls' Burton Chicklet Flat Top Snowboard            115  115         199.95                False      False      BACKORDER
33  Girls' Burton Chicklet Flat Top Snowboard            120  120         199.95                False       True       IN_STOCK
34  Girls' Burton Chicklet Flat Top Snowboard            125  125         199.95                False       True       IN_STOCK
35  Girls' Burton Chicklet Flat Top Snowboard            130  130         199.95                False       True       IN_STOCK

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM