簡體   English   中英

從一個 JavaScript 對象中提取<script> tag in Python and parsing the json

[英]Extracting JavaScript object from a <script> tag in Python and parsing the json

我正在使用 requests_html 從網站上抓取包含產品信息的頁面,我需要的一小部分 HTML 位於<script>標記中。

這是返回 JavaScript 的代碼:

from requests_html import HTMLSession

link = 'https://www.rimi.lv/e-veikals/en/products/vegan-and-vegetarian-/plant-based-beverages/auzu-dzeriens-barista-kafijai-bezglut-uht-1l/p/957905'

s = HTMLSession()
r = s.get(link)
script_html = r.html.find('div.cart-layout__main', first=True).find('script')[1].html

print(script_html)

有沒有辦法解析它的html部分以返回所有文本? 我的意思是 tabs[0].html 中的那個。

<script>
    Config.product_details_page = {
        texts: {
            tab_loading_title: 'Loading',
            tab_loading_text: 'Loading data',
        },
        tabs: [
            {
                index: 0,
                identifier: 'details',
                name: "About the product",
                icon: '&lt;svg class="" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 48 48"&gt;&lt;g fill="none" stroke="currentColor" stroke-width="2" stroke-miterlimit="10"&gt;&lt;circle cx="24" cy="24" r="23"/&gt;&lt;path d="M24 30v-1.6c0-2.1 1.1-4.1 3-5.2 2.9-1.7 3.9-5.3 2.2-8.2-1.7-2.9-5.3-3.9-8.2-2.2-1.8 1.1-3 3-3 5.2"/&gt;&lt;circle cx="24" cy="35" r="2"/&gt;&lt;/g&gt;&lt;/svg&gt;',
                html: "&lt;div class=\u0022product__details\u0022&gt;\n    &lt;div class=\u0022container\u0022&gt;\n        &lt;div class=\u0022product-details\u0022&gt;\n    &lt;div class=\u0022product__list-wrapper\u0022&gt;\n        &lt;ul class=\u0022list\u0022&gt;\n                    &lt;li class=\u0022item\u0022&gt;\n                                    &lt;span&gt;Country of origin&lt;\/span&gt;\n                                                    &lt;p&gt;Finland&lt;\/p&gt;\n                            &lt;\/li&gt;\n                    &lt;li class=\u0022item\u0022&gt;\n                                    &lt;span&gt;Brand&lt;\/span&gt;\n                                                    &lt;p&gt;Valio&lt;\/p&gt;\n                            &lt;\/li&gt;\n                    &lt;li class=\u0022item\u0022&gt;\n                                    &lt;span&gt;Producer&lt;\/span&gt;\n                                                    &lt;p&gt;VALIO OY&lt;\/p&gt;\n                            &lt;\/li&gt;\n                    &lt;li class=\u0022item\u0022&gt;\n                                    &lt;span&gt;Amount&lt;\/span&gt;\n                                                    &lt;p&gt;1 kg&lt;\/p&gt;\n                            &lt;\/li&gt;\n            &lt;\/ul&gt;\n&lt;\/div&gt;\n        &lt;div class=\u0022product__list-wrapper\u0022&gt;\n            &lt;p class=\u0022heading\u0022&gt;Ingredients&lt;\/p&gt;\n        &lt;ul class=\u0022list\u0022&gt;\n                    &lt;li class=\u0022item\u0022&gt;\n                                                    &lt;p&gt;AUZU b\u0101ze ( \u016bdens, bezglut\u0113na AUZU milti, kalcijs, s\u0101ls ), \u016bdens, rap\u0161a e\u013c\u013ca, sk\u0101buma regul\u0113t\u0101ji ( k\u0101lija fosf\u0101ti ), jods, vitam\u012bni ( riboflav\u012bns ( B2 ), B12 un D2 ) \n\n&lt;\/p&gt;\n                            &lt;\/li&gt;\n            &lt;\/ul&gt;\n&lt;\/div&gt;\n        &lt;div class=\u0022product__list-wrapper -simple\u0022&gt;\n            &lt;p class=\u0022heading\u0022&gt;Additional information&lt;\/p&gt;\n        &lt;ul class=\u0022list\u0022&gt;\n                    &lt;li class=\u0022item\u0022&gt;\n                                                    &lt;p&gt;Auzu saturs 10%&lt;\/p&gt;\n                            &lt;\/li&gt;\n                    &lt;li class=\u0022item\u0022&gt;\n                                                    &lt;p&gt;Min storage temp.: 2\u00b0 C&lt;\/p&gt;\n                            &lt;\/li&gt;\n                    &lt;li class=\u0022item\u0022&gt;\n                                                    &lt;p&gt;Max storage temp.: 25\u00b0 C&lt;\/p&gt;\n                            &lt;\/li&gt;\n            &lt;\/ul&gt;\n&lt;\/div&gt;\n            &lt;div class=\u0022product__list-wrapper\u0022&gt;\n            &lt;p class=\u0022heading\u0022&gt;Nutrition Facts&lt;\/p&gt;\n        &lt;ul class=\u0022list\u0022&gt;\n                    &lt;li class=\u0022item\u0022&gt;\n                                                    &lt;p&gt;Amount per 100g&lt;\/p&gt;\n                            &lt;\/li&gt;\n            &lt;\/ul&gt;\n&lt;\/div&gt;\n        &lt;div class=\u0022product__table\u0022&gt;\n    &lt;div&gt;\n        &lt;table&gt;\n            &lt;thead&gt;\n            &lt;tr&gt;\n                                    &lt;th&gt;Nutrition&lt;\/th&gt;\n                                    &lt;th&gt;Amount per 100g\/ml&lt;\/th&gt;\n                            &lt;\/tr&gt;\n            &lt;\/thead&gt;\n            &lt;tbody&gt;\n                            &lt;tr&gt;\n                    &lt;td &gt;\n                        energy\n                    &lt;\/td&gt;\n                    &lt;td&gt;\n                        243 kJ\/ 58 kcal\n                    &lt;\/td&gt;\n                &lt;\/tr&gt;\n                            &lt;tr&gt;\n                    &lt;td &gt;\n                        fat\n                    &lt;\/td&gt;\n                    &lt;td&gt;\n                        3 g\n                    &lt;\/td&gt;\n                &lt;\/tr&gt;\n                            &lt;tr&gt;\n                    &lt;td  class=\u0022indent\u0022&gt;\n                        of which saturates\n                    &lt;\/td&gt;\n                    &lt;td&gt;\n                        0.3 g\n                    &lt;\/td&gt;\n                &lt;\/tr&gt;\n                            &lt;tr&gt;\n                    &lt;td &gt;\n                        carbohydrate\n                    &lt;\/td&gt;\n                    &lt;td&gt;\n                        6.6 g\n                    &lt;\/td&gt;\n                &lt;\/tr&gt;\n                            &lt;tr&gt;\n                    &lt;td  class=\u0022indent\u0022&gt;\n                        of which sugars\n                    &lt;\/td&gt;\n                    &lt;td&gt;\n                        3.5 g\n                    &lt;\/td&gt;\n                &lt;\/tr&gt;\n                            &lt;tr&gt;\n                    &lt;td &gt;\n                        protein\n                    &lt;\/td&gt;\n                    &lt;td&gt;\n                        1.2 g\n                    &lt;\/td&gt;\n                &lt;\/tr&gt;\n                            &lt;tr&gt;\n                    &lt;td &gt;\n                        salt\n                    &lt;\/td&gt;\n                    &lt;td&gt;\n                        0.1 g\n                    &lt;\/td&gt;\n                &lt;\/tr&gt;\n                        &lt;\/tbody&gt;\n        &lt;\/table&gt;\n    &lt;\/div&gt;\n&lt;\/div&gt;                            &lt;div class=\u0022product__list-wrapper\u0022&gt;\n            &lt;p class=\u0022heading\u0022&gt;Allergens&lt;\/p&gt;\n        &lt;ul class=\u0022list\u0022&gt;\n                    &lt;li class=\u0022item\u0022&gt;\n                                                    &lt;p&gt;Cereals&lt;\/p&gt;\n                            &lt;\/li&gt;\n            &lt;\/ul&gt;\n&lt;\/div&gt;\n        &lt;p class=\u0022product__disclaimer\u0022&gt;While every care has been taken to ensure product information is correct, food products are constantly being reformulated, so ingredients, nutrition content, dietary and allergens may change. You should always read the product label and not rely solely on the information provided on the website. Base price and offer may be different in other Rimi stores.&lt;\/p&gt;&lt;\/div&gt;\n\n        &lt;div class=\u0022product__card\u0022&gt;\n            &lt;div data-product-code=\u0022957905\u0022\n     class=\u0022js-product-container card\n                -horizontal-for-mobile\u0022\n     data-gtms-banner-title=\u0022Auzu dz\u0113riens Barista kafijai bezglut. UHT 1l\u0022\n     data-gtms-click-name=\u0022Auzu dz\u0113riens Barista kafijai bezglut. UHT 1l\u0022\n     data-gtms-product-id=\u0022957905\u0022\n     data-gtm-eec-product='{\u0022id\u0022:\u0022957905\u0022,\u0022name\u0022:\u0022Auzu dz\\u0113riens Barista kafijai bezglut. UHT 1l\u0022,\u0022category\u0022:\u0022SH-11-10-2\\\/SH-16\\\/SH\u0022,\u0022brand\u0022:\u0022Valio\u0022,\u0022price\u0022:2.69,\u0022currency\u0022:\u0022EUR\u0022}'\n     &gt;\n    &lt;a class=\u0022card__url js-gtm-eec-product-click\u0022 href=\u0022\/e-veikals\/en\/products\/vegan-and-vegetarian-\/plant-based-beverages\/auzu-dzeriens-barista-kafijai-bezglut-uht-1l\/p\/957905\u0022\n       aria-label=\u0022Go to product page\u0022&gt;&lt;\/a&gt;\n            &lt;div class=\u0022card__image-wrapper\u0022&gt;\n        &lt;div&gt;\n                            &lt;img src=\u0022https:\/\/rimibaltic-res.cloudinary.com\/image\/upload\/b_white,c_fit,f_auto,h_480,q_auto,w_480\/d_ecommerce:backend-fallback.png\/MAT_957905_PCE_LV\u0022 alt=\u0022Auzu dz\u0113riens Barista kafijai bezglut. UHT 1l\u0022&gt;\n                                                                &lt;span class=\u0022type-badge\u0022&gt;\n            &lt;img src=\u0022https:\/\/rimibaltic-web-res.cloudinary.com\/image\/upload\/f_png,h_32,q_auto\/v1\/ecom-cms\/b821da9405a9fe157949ca40850238c81d90542f\u0022  title=\u0022Suitable for Vegans\u0022 &gt;\n            &lt;img src=\u0022https:\/\/rimibaltic-web-res.cloudinary.com\/image\/upload\/f_png,h_32,q_auto\/v1\/ecom-cms\/91c5d4f7982c687e299aaf2e8c985d63f66631dd\u0022  title=\u0022Gluten Free\u0022 &gt;\n            &lt;img src=\u0022https:\/\/rimibaltic-web-res.cloudinary.com\/image\/upload\/f_png,h_32,q_auto\/v1\/ecom-cms\/2e1c205f284be9cb954d044ffcfc33afe873ea08\u0022  title=\u0022Lactose Free\u0022 &gt;\n            &lt;img src=\u0022https:\/\/rimibaltic-web-res.cloudinary.com\/image\/upload\/f_png,h_32,q_auto\/v1\/ecom-cms\/e94c4a7ccc9aabb3b6ce9382a536f514acf72616\u0022  title=\u0022Dairy Free\u0022 &gt;\n    &lt;\/span&gt;        &lt;\/div&gt;\n    &lt;\/div&gt;\n    &lt;div class=\u0022card__details\u0022&gt;\n                                    &lt;p class=\u0022card__name\u0022&gt;Auzu dz\u0113riens Barista kafijai bezglut. UHT 1l&lt;\/p&gt;\n        &lt;div class=\u0022card__details-inner\u0022&gt;\n\n            &lt;div class=\u0022card__price-wrapper\u0022&gt;\n    \n        &lt;div class=\u0022price-tag card__price\u0022&gt;\n    &lt;span&gt;2&lt;\/span&gt;\n    &lt;div&gt;\n        &lt;sup&gt;69&lt;\/sup&gt;\n        &lt;sub&gt;\u20ac\/pcs.&lt;\/sub&gt;\n    &lt;\/div&gt;\n&lt;\/div&gt;\n        &lt;div&gt;\n\n            \n                            &lt;p class=\u0022card__price-per\u0022&gt;\n                    2,69\n                    \u20ac\n                    \/kg\n                &lt;\/p&gt;\n            \n        &lt;\/div&gt;\n    &lt;\/div&gt;\n\n\n            &lt;form class=\u0022favorite  card__favorite  js-login-prompt\u0022\n      action=\u0022\/e-veikals\/account\/login\/prompt\u0022&gt;\n    &lt;input type=\u0022hidden\u0022 name=\u0022_token\u0022 value=\u002267RNG9eJsKaHhthRxGbeoL97AiwFKSkcCd6RUaoR\u0022&gt;    &lt;input type=\u0022checkbox\u0022 name=\u0022favorite\u0022 value=\u0022\u0022 &gt;\n    &lt;button class=\u0022js-tooltip\u0022 type=\u0022submit\u0022\n            aria-label=\u0022Add to favorites\u0022\n            data-title=\u0022Add to favorites\u0022\n            data-add-name=\u0022Add to favorites\u0022\n            data-remove-name=\u0022Add to favorites\u0022\n            data-gtm-click-name=\u0022Add to favorites\u0022&gt;\n        &lt;span&gt;&lt;svg class=\u0022\u0022 xmlns=\u0022http:\/\/www.w3.org\/2000\/svg\u0022 viewBox=\u00220 0 48 48\u0022&gt;&lt;path d=\u0022M24 4l5.05 16L45 19.98l-12.83 8.79L36.98 44 24 34.71 11.02 44l4.81-15.23L3 19.98l15.95.02L24 4z\u0022 fill=\u0022none\u0022 stroke=\u0022currentColor\u0022 stroke-miterlimit=\u002210\u0022 stroke-width=\u00222\u0022\/&gt;&lt;\/svg&gt;&lt;\/span&gt;\n    &lt;\/button&gt;\n&lt;\/form&gt;\n\n                            \n                \n                &lt;form method=\u0022post\u0022 action=\u0022\/e-veikals\/cart\/change\u0022\n      class=\u0022js-add-to-cart card__cart-btn\u0022&gt;\n    &lt;input type=\u0022hidden\u0022 name=\u0022_token\u0022 value=\u002267RNG9eJsKaHhthRxGbeoL97AiwFKSkcCd6RUaoR\u0022&gt;    &lt;input type=\u0022hidden\u0022 name=\u0022_method\u0022 value=\u0022put\u0022&gt;    &lt;input type=\u0022hidden\u0022 name=\u0022product\u0022 value=\u0022957905\u0022&gt;\n    &lt;input type=\u0022hidden\u0022 name=\u0022amount\u0022 value=\u00221\u0022&gt;\n    &lt;button class=\u0022button -with-right-icon -cart gtm -small\u0022\n            type=\u0022submit\u0022\n            data-gtm-product-id=\u0022957905\u0022\n            data-gtm-event-category=\u0022addToBasket\u0022\n    &gt;\n        Add to cart\n        &lt;svg class=\u0022\u0022 xmlns=\u0022http:\/\/www.w3.org\/2000\/svg\u0022 viewBox=\u00220 0 48 48\u0022&gt;&lt;g fill=\u0022none\u0022 stroke=\u0022currentColor\u0022 stroke-miterlimit=\u002210\u0022 stroke-width=\u00222\u0022&gt;&lt;path d=\u0022M44 36H19.2c-3.9 0-7.2-2.8-7.9-6.6L6.5 1H0\u0022\/&gt;&lt;path d=\u0022M8 9h39l-2.4 11.6c-.9 4.4-4.7 7.6-9.1 7.9l-24 1.5\u0022\/&gt;&lt;circle cx=\u002215.5\u0022 cy=\u002243.5\u0022 r=\u00223.5\u0022\/&gt;&lt;circle cx=\u002239.5\u0022 cy=\u002243.5\u0022 r=\u00223.5\u0022\/&gt;&lt;\/g&gt;&lt;\/svg&gt;    &lt;\/button&gt;\n&lt;\/form&gt;\n\n                &lt;form class=\u0022counter  js-counter\u0022\n      method=\u0022post\u0022\n      action=\u0022\/e-veikals\/cart\/change\u0022\n&gt;\n    &lt;input type=\u0022hidden\u0022 name=\u0022_method\u0022 value=\u0022put\u0022&gt;    &lt;input type=\u0022hidden\u0022 name=\u0022_token\u0022 value=\u002267RNG9eJsKaHhthRxGbeoL97AiwFKSkcCd6RUaoR\u0022&gt;    &lt;input type=\u0022hidden\u0022 name=\u0022amount\u0022\n           value=\u00221\u0022\n           min=\u00221\u0022\n           max=\u002210\u0022\n           data-unit=\u0022Piece\u0022\n    &gt;\n    &lt;input type=\u0022hidden\u0022 name=\u0022step\u0022 value=\u00221\u0022&gt;\n    &lt;input type=\u0022hidden\u0022 name=\u0022product\u0022 value=\u0022957905\u0022&gt;\n    &lt;button name=\u0022decrease\u0022\n            class=\u0022counter__subtract js-subtract\u0022\n            type=\u0022submit\u0022\n            aria-label=\u0022Decrease\u0022\n            data-gtm-ignore&gt;\n        &lt;svg class=\u0022\u0022 xmlns=\u0022http:\/\/www.w3.org\/2000\/svg\u0022 viewBox=\u00220 0 48 48\u0022&gt;&lt;path d=\u0022M8 24h32\u0022 fill=\u0022none\u0022 stroke=\u0022currentColor\u0022 stroke-width=\u00222\u0022 stroke-miterlimit=\u002210\u0022\/&gt;&lt;\/svg&gt;    &lt;\/button&gt;\n    &lt;span class=\u0022counter__number\u0022&gt;\n        1    &lt;\/span&gt;\n    &lt;button name=\u0022increase\u0022\n            class=\u0022counter__add js-add\u0022\n            type=\u0022submit\u0022\n            aria-label=\u0022Increase\u0022\n            data-gtm-ignore\n                &gt;\n        &lt;svg class=\u0022\u0022 xmlns=\u0022http:\/\/www.w3.org\/2000\/svg\u0022 viewBox=\u00220 0 48 48\u0022&gt;&lt;path d=\u0022M6 24h36M24 42V5.9\u0022 fill=\u0022none\u0022 stroke=\u0022currentColor\u0022 stroke-width=\u00222\u0022 stroke-miterlimit=\u002210\u0022\/&gt;&lt;\/svg&gt;    &lt;\/button&gt;\n\n&lt;\/form&gt;\n\n                &lt;form class=\u0022js-delete-from-cart delete-form\u0022 method=\u0022post\u0022 action=\u0022\/e-veikals\/cart\/change\u0022&gt;\n    &lt;input type=\u0022hidden\u0022 name=\u0022_method\u0022 value=\u0022put\u0022&gt;    &lt;input type=\u0022hidden\u0022 name=\u0022_token\u0022 value=\u002267RNG9eJsKaHhthRxGbeoL97AiwFKSkcCd6RUaoR\u0022&gt;    &lt;input type=\u0022hidden\u0022 value=\u0022957905\u0022 name=\u0022product\u0022&gt;\n    &lt;button class=\u0022cart-card__delete js-delete js-remove-from-cart \u0022\n            aria-label=\u0022Remove\u0022&gt;\n        &lt;svg class=\u0022\u0022 xmlns=\u0022http:\/\/www.w3.org\/2000\/svg\u0022 viewBox=\u00220 0 48 48\u0022&gt;&lt;path d=\u0022M10 10l28 28m-28 0l28-28\u0022 fill=\u0022none\u0022 stroke=\u0022currentColor\u0022 stroke-width=\u00222\u0022 stroke-miterlimit=\u002210\u0022\/&gt;&lt;\/svg&gt;    &lt;\/button&gt;\n&lt;\/form&gt;\n            \n\n        &lt;\/div&gt;\n\n        &lt;p class=\u0022card__error\u0022&gt;\n            Maximum amount is reached\n        &lt;\/p&gt;\n\n    &lt;\/div&gt;\n&lt;\/div&gt;\n        &lt;\/div&gt;\n    &lt;\/div&gt;\n&lt;\/div&gt;\n",
            },
            {
                index: 1,
                identifier: 'recommendations',
                name: "Others have also bought",
                api_url: "/e-veikals/en/products/957905/recommendations",
                icon: '&lt;svg class="" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 48 48"&gt;&lt;path fill="none" stroke="currentColor" stroke-miterlimit="10" stroke-width="2" d="M8 1h32v40c0 3.3-2.7 6-6 6H14c-3.3 0-6-2.7-6-6V1zm0 26h32m-5-3v-6m0 18v-6"/&gt;&lt;/svg&gt;',
                html: null,
            },
        ]
    };
        Config.product_details_page.tabs.push({
        index: 2,
        identifier: 'recipes',
        name: "Recipes",
        api_url: "/e-veikals/en/products/957905/recipes",
        icon: '&lt;svg class="" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 48 48"&gt;&lt;path fill="none" stroke="currentColor" stroke-miterlimit="10" stroke-width="2" d="M38 47c-1.7 0-3-1.3-3-3V25.5l-1.7-1.7c-1.5-1.5-2.3-3.5-2.3-5.6V11c0-5.5 4.5-10 10-10v43c0 1.7-1.3 3-3 3zM24 1l1 13.1c0 1.9-1.2 3.7-2.4 5.1L19 23v21c0 1.7-1.3 3-3 3s-3-1.3-3-3V23l-3.6-3.8C8 17.8 7.2 16 7 14L8 1m5 0v14m6-14v14"/&gt;&lt;/svg&gt;',
        html: null,
    });
    </script>

我嘗試將其加載為文本(text[30:-2] 僅獲取 JavaScript 對象),然后通過 demjson.decode() 加載它,但似乎必須以特定方式加載該字符串(作為字面),我不知道該怎么做。

謝謝!

本質上,您只需要與指定塊中的 html 鍵關聯的值。 你可以正則表達式。 然后,您需要進行一些字符串清理,以獲得可以正確解析的 HTML 所需的 Unicode 代碼點(對於代碼點清理,我使用下面鏈接的@Mark Tolonen 給出的答案)。

import requests, re
from bs4 import BeautifulSoup as bs

r = requests.get('https://www.rimi.lv/e-veikals/en/products/vegan-and-vegetarian-/plant-based-beverages/auzu-dzeriens-barista-kafijai-bezglut-uht-1l/p/957905')

s = re.search(r'tabs.*html: "(.*?)"', r.text, re.S).group(1)

#https://stackoverflow.com/a/64071813 to clean unicode @Mark Tolonen
soup = bs(re.sub(r'\\u([0-9a-fA-F]{4})',lambda m: chr(int(m.group(1),16)),s))

print(soup)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM