繁体   English   中英

从一个 JavaScript 对象中提取<script> tag in Python and parsing the json

[英]Extracting JavaScript object from a <script> tag in Python and parsing the json

我正在使用 requests_html 从网站上抓取包含产品信息的页面,我需要的一小部分 HTML 位于<script>标记中。

这是返回 JavaScript 的代码:

from requests_html import HTMLSession

link = 'https://www.rimi.lv/e-veikals/en/products/vegan-and-vegetarian-/plant-based-beverages/auzu-dzeriens-barista-kafijai-bezglut-uht-1l/p/957905'

s = HTMLSession()
r = s.get(link)
script_html = r.html.find('div.cart-layout__main', first=True).find('script')[1].html

print(script_html)

有没有办法解析它的html部分以返回所有文本? 我的意思是 tabs[0].html 中的那个。

<script>
    Config.product_details_page = {
        texts: {
            tab_loading_title: 'Loading',
            tab_loading_text: 'Loading data',
        },
        tabs: [
            {
                index: 0,
                identifier: 'details',
                name: "About the product",
                icon: '&lt;svg class="" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 48 48"&gt;&lt;g fill="none" stroke="currentColor" stroke-width="2" stroke-miterlimit="10"&gt;&lt;circle cx="24" cy="24" r="23"/&gt;&lt;path d="M24 30v-1.6c0-2.1 1.1-4.1 3-5.2 2.9-1.7 3.9-5.3 2.2-8.2-1.7-2.9-5.3-3.9-8.2-2.2-1.8 1.1-3 3-3 5.2"/&gt;&lt;circle cx="24" cy="35" r="2"/&gt;&lt;/g&gt;&lt;/svg&gt;',
                html: "&lt;div class=\u0022product__details\u0022&gt;\n    &lt;div class=\u0022container\u0022&gt;\n        &lt;div class=\u0022product-details\u0022&gt;\n    &lt;div class=\u0022product__list-wrapper\u0022&gt;\n        &lt;ul class=\u0022list\u0022&gt;\n                    &lt;li class=\u0022item\u0022&gt;\n                                    &lt;span&gt;Country of origin&lt;\/span&gt;\n                                                    &lt;p&gt;Finland&lt;\/p&gt;\n                            &lt;\/li&gt;\n                    &lt;li class=\u0022item\u0022&gt;\n                                    &lt;span&gt;Brand&lt;\/span&gt;\n                                                    &lt;p&gt;Valio&lt;\/p&gt;\n                            &lt;\/li&gt;\n                    &lt;li class=\u0022item\u0022&gt;\n                                    &lt;span&gt;Producer&lt;\/span&gt;\n                                                    &lt;p&gt;VALIO OY&lt;\/p&gt;\n                            &lt;\/li&gt;\n                    &lt;li class=\u0022item\u0022&gt;\n                                    &lt;span&gt;Amount&lt;\/span&gt;\n                                                    &lt;p&gt;1 kg&lt;\/p&gt;\n                            &lt;\/li&gt;\n            &lt;\/ul&gt;\n&lt;\/div&gt;\n        &lt;div class=\u0022product__list-wrapper\u0022&gt;\n            &lt;p class=\u0022heading\u0022&gt;Ingredients&lt;\/p&gt;\n        &lt;ul class=\u0022list\u0022&gt;\n                    &lt;li class=\u0022item\u0022&gt;\n                                                    &lt;p&gt;AUZU b\u0101ze ( \u016bdens, bezglut\u0113na AUZU milti, kalcijs, s\u0101ls ), \u016bdens, rap\u0161a e\u013c\u013ca, sk\u0101buma regul\u0113t\u0101ji ( k\u0101lija fosf\u0101ti ), jods, vitam\u012bni ( riboflav\u012bns ( B2 ), B12 un D2 ) \n\n&lt;\/p&gt;\n                            &lt;\/li&gt;\n            &lt;\/ul&gt;\n&lt;\/div&gt;\n        &lt;div class=\u0022product__list-wrapper -simple\u0022&gt;\n            &lt;p class=\u0022heading\u0022&gt;Additional information&lt;\/p&gt;\n        &lt;ul class=\u0022list\u0022&gt;\n                    &lt;li class=\u0022item\u0022&gt;\n                                                    &lt;p&gt;Auzu saturs 10%&lt;\/p&gt;\n                            &lt;\/li&gt;\n                    &lt;li class=\u0022item\u0022&gt;\n                                                    &lt;p&gt;Min storage temp.: 2\u00b0 C&lt;\/p&gt;\n                            &lt;\/li&gt;\n                    &lt;li class=\u0022item\u0022&gt;\n                                                    &lt;p&gt;Max storage temp.: 25\u00b0 C&lt;\/p&gt;\n                            &lt;\/li&gt;\n            &lt;\/ul&gt;\n&lt;\/div&gt;\n            &lt;div class=\u0022product__list-wrapper\u0022&gt;\n            &lt;p class=\u0022heading\u0022&gt;Nutrition Facts&lt;\/p&gt;\n        &lt;ul class=\u0022list\u0022&gt;\n                    &lt;li class=\u0022item\u0022&gt;\n                                                    &lt;p&gt;Amount per 100g&lt;\/p&gt;\n                            &lt;\/li&gt;\n            &lt;\/ul&gt;\n&lt;\/div&gt;\n        &lt;div class=\u0022product__table\u0022&gt;\n    &lt;div&gt;\n        &lt;table&gt;\n            &lt;thead&gt;\n            &lt;tr&gt;\n                                    &lt;th&gt;Nutrition&lt;\/th&gt;\n                                    &lt;th&gt;Amount per 100g\/ml&lt;\/th&gt;\n                            &lt;\/tr&gt;\n            &lt;\/thead&gt;\n            &lt;tbody&gt;\n                            &lt;tr&gt;\n                    &lt;td &gt;\n                        energy\n                    &lt;\/td&gt;\n                    &lt;td&gt;\n                        243 kJ\/ 58 kcal\n                    &lt;\/td&gt;\n                &lt;\/tr&gt;\n                            &lt;tr&gt;\n                    &lt;td &gt;\n                        fat\n                    &lt;\/td&gt;\n                    &lt;td&gt;\n                        3 g\n                    &lt;\/td&gt;\n                &lt;\/tr&gt;\n                            &lt;tr&gt;\n                    &lt;td  class=\u0022indent\u0022&gt;\n                        of which saturates\n                    &lt;\/td&gt;\n                    &lt;td&gt;\n                        0.3 g\n                    &lt;\/td&gt;\n                &lt;\/tr&gt;\n                            &lt;tr&gt;\n                    &lt;td &gt;\n                        carbohydrate\n                    &lt;\/td&gt;\n                    &lt;td&gt;\n                        6.6 g\n                    &lt;\/td&gt;\n                &lt;\/tr&gt;\n                            &lt;tr&gt;\n                    &lt;td  class=\u0022indent\u0022&gt;\n                        of which sugars\n                    &lt;\/td&gt;\n                    &lt;td&gt;\n                        3.5 g\n                    &lt;\/td&gt;\n                &lt;\/tr&gt;\n                            &lt;tr&gt;\n                    &lt;td &gt;\n                        protein\n                    &lt;\/td&gt;\n                    &lt;td&gt;\n                        1.2 g\n                    &lt;\/td&gt;\n                &lt;\/tr&gt;\n                            &lt;tr&gt;\n                    &lt;td &gt;\n                        salt\n                    &lt;\/td&gt;\n                    &lt;td&gt;\n                        0.1 g\n                    &lt;\/td&gt;\n                &lt;\/tr&gt;\n                        &lt;\/tbody&gt;\n        &lt;\/table&gt;\n    &lt;\/div&gt;\n&lt;\/div&gt;                            &lt;div class=\u0022product__list-wrapper\u0022&gt;\n            &lt;p class=\u0022heading\u0022&gt;Allergens&lt;\/p&gt;\n        &lt;ul class=\u0022list\u0022&gt;\n                    &lt;li class=\u0022item\u0022&gt;\n                                                    &lt;p&gt;Cereals&lt;\/p&gt;\n                            &lt;\/li&gt;\n            &lt;\/ul&gt;\n&lt;\/div&gt;\n        &lt;p class=\u0022product__disclaimer\u0022&gt;While every care has been taken to ensure product information is correct, food products are constantly being reformulated, so ingredients, nutrition content, dietary and allergens may change. You should always read the product label and not rely solely on the information provided on the website. Base price and offer may be different in other Rimi stores.&lt;\/p&gt;&lt;\/div&gt;\n\n        &lt;div class=\u0022product__card\u0022&gt;\n            &lt;div data-product-code=\u0022957905\u0022\n     class=\u0022js-product-container card\n                -horizontal-for-mobile\u0022\n     data-gtms-banner-title=\u0022Auzu dz\u0113riens Barista kafijai bezglut. UHT 1l\u0022\n     data-gtms-click-name=\u0022Auzu dz\u0113riens Barista kafijai bezglut. UHT 1l\u0022\n     data-gtms-product-id=\u0022957905\u0022\n     data-gtm-eec-product='{\u0022id\u0022:\u0022957905\u0022,\u0022name\u0022:\u0022Auzu dz\\u0113riens Barista kafijai bezglut. UHT 1l\u0022,\u0022category\u0022:\u0022SH-11-10-2\\\/SH-16\\\/SH\u0022,\u0022brand\u0022:\u0022Valio\u0022,\u0022price\u0022:2.69,\u0022currency\u0022:\u0022EUR\u0022}'\n     &gt;\n    &lt;a class=\u0022card__url js-gtm-eec-product-click\u0022 href=\u0022\/e-veikals\/en\/products\/vegan-and-vegetarian-\/plant-based-beverages\/auzu-dzeriens-barista-kafijai-bezglut-uht-1l\/p\/957905\u0022\n       aria-label=\u0022Go to product page\u0022&gt;&lt;\/a&gt;\n            &lt;div class=\u0022card__image-wrapper\u0022&gt;\n        &lt;div&gt;\n                            &lt;img src=\u0022https:\/\/rimibaltic-res.cloudinary.com\/image\/upload\/b_white,c_fit,f_auto,h_480,q_auto,w_480\/d_ecommerce:backend-fallback.png\/MAT_957905_PCE_LV\u0022 alt=\u0022Auzu dz\u0113riens Barista kafijai bezglut. UHT 1l\u0022&gt;\n                                                                &lt;span class=\u0022type-badge\u0022&gt;\n            &lt;img src=\u0022https:\/\/rimibaltic-web-res.cloudinary.com\/image\/upload\/f_png,h_32,q_auto\/v1\/ecom-cms\/b821da9405a9fe157949ca40850238c81d90542f\u0022  title=\u0022Suitable for Vegans\u0022 &gt;\n            &lt;img src=\u0022https:\/\/rimibaltic-web-res.cloudinary.com\/image\/upload\/f_png,h_32,q_auto\/v1\/ecom-cms\/91c5d4f7982c687e299aaf2e8c985d63f66631dd\u0022  title=\u0022Gluten Free\u0022 &gt;\n            &lt;img src=\u0022https:\/\/rimibaltic-web-res.cloudinary.com\/image\/upload\/f_png,h_32,q_auto\/v1\/ecom-cms\/2e1c205f284be9cb954d044ffcfc33afe873ea08\u0022  title=\u0022Lactose Free\u0022 &gt;\n            &lt;img src=\u0022https:\/\/rimibaltic-web-res.cloudinary.com\/image\/upload\/f_png,h_32,q_auto\/v1\/ecom-cms\/e94c4a7ccc9aabb3b6ce9382a536f514acf72616\u0022  title=\u0022Dairy Free\u0022 &gt;\n    &lt;\/span&gt;        &lt;\/div&gt;\n    &lt;\/div&gt;\n    &lt;div class=\u0022card__details\u0022&gt;\n                                    &lt;p class=\u0022card__name\u0022&gt;Auzu dz\u0113riens Barista kafijai bezglut. UHT 1l&lt;\/p&gt;\n        &lt;div class=\u0022card__details-inner\u0022&gt;\n\n            &lt;div class=\u0022card__price-wrapper\u0022&gt;\n    \n        &lt;div class=\u0022price-tag card__price\u0022&gt;\n    &lt;span&gt;2&lt;\/span&gt;\n    &lt;div&gt;\n        &lt;sup&gt;69&lt;\/sup&gt;\n        &lt;sub&gt;\u20ac\/pcs.&lt;\/sub&gt;\n    &lt;\/div&gt;\n&lt;\/div&gt;\n        &lt;div&gt;\n\n            \n                            &lt;p class=\u0022card__price-per\u0022&gt;\n                    2,69\n                    \u20ac\n                    \/kg\n                &lt;\/p&gt;\n            \n        &lt;\/div&gt;\n    &lt;\/div&gt;\n\n\n            &lt;form class=\u0022favorite  card__favorite  js-login-prompt\u0022\n      action=\u0022\/e-veikals\/account\/login\/prompt\u0022&gt;\n    &lt;input type=\u0022hidden\u0022 name=\u0022_token\u0022 value=\u002267RNG9eJsKaHhthRxGbeoL97AiwFKSkcCd6RUaoR\u0022&gt;    &lt;input type=\u0022checkbox\u0022 name=\u0022favorite\u0022 value=\u0022\u0022 &gt;\n    &lt;button class=\u0022js-tooltip\u0022 type=\u0022submit\u0022\n            aria-label=\u0022Add to favorites\u0022\n            data-title=\u0022Add to favorites\u0022\n            data-add-name=\u0022Add to favorites\u0022\n            data-remove-name=\u0022Add to favorites\u0022\n            data-gtm-click-name=\u0022Add to favorites\u0022&gt;\n        &lt;span&gt;&lt;svg class=\u0022\u0022 xmlns=\u0022http:\/\/www.w3.org\/2000\/svg\u0022 viewBox=\u00220 0 48 48\u0022&gt;&lt;path d=\u0022M24 4l5.05 16L45 19.98l-12.83 8.79L36.98 44 24 34.71 11.02 44l4.81-15.23L3 19.98l15.95.02L24 4z\u0022 fill=\u0022none\u0022 stroke=\u0022currentColor\u0022 stroke-miterlimit=\u002210\u0022 stroke-width=\u00222\u0022\/&gt;&lt;\/svg&gt;&lt;\/span&gt;\n    &lt;\/button&gt;\n&lt;\/form&gt;\n\n                            \n                \n                &lt;form method=\u0022post\u0022 action=\u0022\/e-veikals\/cart\/change\u0022\n      class=\u0022js-add-to-cart card__cart-btn\u0022&gt;\n    &lt;input type=\u0022hidden\u0022 name=\u0022_token\u0022 value=\u002267RNG9eJsKaHhthRxGbeoL97AiwFKSkcCd6RUaoR\u0022&gt;    &lt;input type=\u0022hidden\u0022 name=\u0022_method\u0022 value=\u0022put\u0022&gt;    &lt;input type=\u0022hidden\u0022 name=\u0022product\u0022 value=\u0022957905\u0022&gt;\n    &lt;input type=\u0022hidden\u0022 name=\u0022amount\u0022 value=\u00221\u0022&gt;\n    &lt;button class=\u0022button -with-right-icon -cart gtm -small\u0022\n            type=\u0022submit\u0022\n            data-gtm-product-id=\u0022957905\u0022\n            data-gtm-event-category=\u0022addToBasket\u0022\n    &gt;\n        Add to cart\n        &lt;svg class=\u0022\u0022 xmlns=\u0022http:\/\/www.w3.org\/2000\/svg\u0022 viewBox=\u00220 0 48 48\u0022&gt;&lt;g fill=\u0022none\u0022 stroke=\u0022currentColor\u0022 stroke-miterlimit=\u002210\u0022 stroke-width=\u00222\u0022&gt;&lt;path d=\u0022M44 36H19.2c-3.9 0-7.2-2.8-7.9-6.6L6.5 1H0\u0022\/&gt;&lt;path d=\u0022M8 9h39l-2.4 11.6c-.9 4.4-4.7 7.6-9.1 7.9l-24 1.5\u0022\/&gt;&lt;circle cx=\u002215.5\u0022 cy=\u002243.5\u0022 r=\u00223.5\u0022\/&gt;&lt;circle cx=\u002239.5\u0022 cy=\u002243.5\u0022 r=\u00223.5\u0022\/&gt;&lt;\/g&gt;&lt;\/svg&gt;    &lt;\/button&gt;\n&lt;\/form&gt;\n\n                &lt;form class=\u0022counter  js-counter\u0022\n      method=\u0022post\u0022\n      action=\u0022\/e-veikals\/cart\/change\u0022\n&gt;\n    &lt;input type=\u0022hidden\u0022 name=\u0022_method\u0022 value=\u0022put\u0022&gt;    &lt;input type=\u0022hidden\u0022 name=\u0022_token\u0022 value=\u002267RNG9eJsKaHhthRxGbeoL97AiwFKSkcCd6RUaoR\u0022&gt;    &lt;input type=\u0022hidden\u0022 name=\u0022amount\u0022\n           value=\u00221\u0022\n           min=\u00221\u0022\n           max=\u002210\u0022\n           data-unit=\u0022Piece\u0022\n    &gt;\n    &lt;input type=\u0022hidden\u0022 name=\u0022step\u0022 value=\u00221\u0022&gt;\n    &lt;input type=\u0022hidden\u0022 name=\u0022product\u0022 value=\u0022957905\u0022&gt;\n    &lt;button name=\u0022decrease\u0022\n            class=\u0022counter__subtract js-subtract\u0022\n            type=\u0022submit\u0022\n            aria-label=\u0022Decrease\u0022\n            data-gtm-ignore&gt;\n        &lt;svg class=\u0022\u0022 xmlns=\u0022http:\/\/www.w3.org\/2000\/svg\u0022 viewBox=\u00220 0 48 48\u0022&gt;&lt;path d=\u0022M8 24h32\u0022 fill=\u0022none\u0022 stroke=\u0022currentColor\u0022 stroke-width=\u00222\u0022 stroke-miterlimit=\u002210\u0022\/&gt;&lt;\/svg&gt;    &lt;\/button&gt;\n    &lt;span class=\u0022counter__number\u0022&gt;\n        1    &lt;\/span&gt;\n    &lt;button name=\u0022increase\u0022\n            class=\u0022counter__add js-add\u0022\n            type=\u0022submit\u0022\n            aria-label=\u0022Increase\u0022\n            data-gtm-ignore\n                &gt;\n        &lt;svg class=\u0022\u0022 xmlns=\u0022http:\/\/www.w3.org\/2000\/svg\u0022 viewBox=\u00220 0 48 48\u0022&gt;&lt;path d=\u0022M6 24h36M24 42V5.9\u0022 fill=\u0022none\u0022 stroke=\u0022currentColor\u0022 stroke-width=\u00222\u0022 stroke-miterlimit=\u002210\u0022\/&gt;&lt;\/svg&gt;    &lt;\/button&gt;\n\n&lt;\/form&gt;\n\n                &lt;form class=\u0022js-delete-from-cart delete-form\u0022 method=\u0022post\u0022 action=\u0022\/e-veikals\/cart\/change\u0022&gt;\n    &lt;input type=\u0022hidden\u0022 name=\u0022_method\u0022 value=\u0022put\u0022&gt;    &lt;input type=\u0022hidden\u0022 name=\u0022_token\u0022 value=\u002267RNG9eJsKaHhthRxGbeoL97AiwFKSkcCd6RUaoR\u0022&gt;    &lt;input type=\u0022hidden\u0022 value=\u0022957905\u0022 name=\u0022product\u0022&gt;\n    &lt;button class=\u0022cart-card__delete js-delete js-remove-from-cart \u0022\n            aria-label=\u0022Remove\u0022&gt;\n        &lt;svg class=\u0022\u0022 xmlns=\u0022http:\/\/www.w3.org\/2000\/svg\u0022 viewBox=\u00220 0 48 48\u0022&gt;&lt;path d=\u0022M10 10l28 28m-28 0l28-28\u0022 fill=\u0022none\u0022 stroke=\u0022currentColor\u0022 stroke-width=\u00222\u0022 stroke-miterlimit=\u002210\u0022\/&gt;&lt;\/svg&gt;    &lt;\/button&gt;\n&lt;\/form&gt;\n            \n\n        &lt;\/div&gt;\n\n        &lt;p class=\u0022card__error\u0022&gt;\n            Maximum amount is reached\n        &lt;\/p&gt;\n\n    &lt;\/div&gt;\n&lt;\/div&gt;\n        &lt;\/div&gt;\n    &lt;\/div&gt;\n&lt;\/div&gt;\n",
            },
            {
                index: 1,
                identifier: 'recommendations',
                name: "Others have also bought",
                api_url: "/e-veikals/en/products/957905/recommendations",
                icon: '&lt;svg class="" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 48 48"&gt;&lt;path fill="none" stroke="currentColor" stroke-miterlimit="10" stroke-width="2" d="M8 1h32v40c0 3.3-2.7 6-6 6H14c-3.3 0-6-2.7-6-6V1zm0 26h32m-5-3v-6m0 18v-6"/&gt;&lt;/svg&gt;',
                html: null,
            },
        ]
    };
        Config.product_details_page.tabs.push({
        index: 2,
        identifier: 'recipes',
        name: "Recipes",
        api_url: "/e-veikals/en/products/957905/recipes",
        icon: '&lt;svg class="" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 48 48"&gt;&lt;path fill="none" stroke="currentColor" stroke-miterlimit="10" stroke-width="2" d="M38 47c-1.7 0-3-1.3-3-3V25.5l-1.7-1.7c-1.5-1.5-2.3-3.5-2.3-5.6V11c0-5.5 4.5-10 10-10v43c0 1.7-1.3 3-3 3zM24 1l1 13.1c0 1.9-1.2 3.7-2.4 5.1L19 23v21c0 1.7-1.3 3-3 3s-3-1.3-3-3V23l-3.6-3.8C8 17.8 7.2 16 7 14L8 1m5 0v14m6-14v14"/&gt;&lt;/svg&gt;',
        html: null,
    });
    </script>

我尝试将其加载为文本(text[30:-2] 仅获取 JavaScript 对象),然后通过 demjson.decode() 加载它,但似乎必须以特定方式加载该字符串(作为字面),我不知道该怎么做。

谢谢!

本质上,您只需要与指定块中的 html 键关联的值。 你可以正则表达式。 然后,您需要进行一些字符串清理,以获得可以正确解析的 HTML 所需的 Unicode 代码点(对于代码点清理,我使用下面链接的@Mark Tolonen 给出的答案)。

import requests, re
from bs4 import BeautifulSoup as bs

r = requests.get('https://www.rimi.lv/e-veikals/en/products/vegan-and-vegetarian-/plant-based-beverages/auzu-dzeriens-barista-kafijai-bezglut-uht-1l/p/957905')

s = re.search(r'tabs.*html: "(.*?)"', r.text, re.S).group(1)

#https://stackoverflow.com/a/64071813 to clean unicode @Mark Tolonen
soup = bs(re.sub(r'\\u([0-9a-fA-F]{4})',lambda m: chr(int(m.group(1),16)),s))

print(soup)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM