简体   繁体   English

从<script> tag in html

[英]Extracting a var from <script> tag in html

I am trying to web scrape product reviews from a page but I'm not sure how to extract a var inside the <script> tags.我正在尝试从页面中抓取产品评论,但我不确定如何在<script>标签中提取 var。

Here's my python code:这是我的python代码:

import requests
from bs4 import BeautifulSoup
import csv

a_file = open("ProductReviews.csv", "a")
writer = csv.writer(a_file)

# Write the titles of the columns to the CSV file
writer.writerow(["created_at", "reviewer_name", "rating", "content", "source"])

url = 'https://www.lazada.com.my/products/iron-gym-total-upper-body-workout-bar-i467342383.html'

# Connect to the URL
response = requests.get(url)

# Parse HTML and save to BeautifulSoup object
soup = BeautifulSoup(response.content, "html.parser")

data = soup.findAll('script')[123]

if 'var __moduleData__' in data.string:
    print("Yes")

Here's the page source (I removed the unnecessary code):这是页面源代码(我删除了不必要的代码):

<html>
<head>
    <title></title>
</head>
<body>

    <script>
        var __moduleData__ = {
        "data": {
            "root": {
                "fields": {
                    "review": {
                        "reviews": [{
                            "rating": 5,
                            "reviewContent": "tq barang dah sampai",
                            "reviewTime": "24 May 2021",
                            "reviewer": "Jaharinbaharin",

                        }, {
                            "rating": 5,
                            "reviewContent": "Beautiful quality👌👌👌",
                            "reviewTime": "08 Sep 2021",
                            "reviewer": "M***.",

                        }, {
                            "rating": 5,
                            "reviewContent": "the box was badly dented but the item was intact...just that my door frame is shallow and slippery....I can't pull up without worrying of falling down",
                            "reviewTime": "25 Aug 2021",
                            "reviewer": "David S.",

                        }, {
                            "rating": 5,
                            "reviewContent": "Haven’t really opened it yet but please put some effort on the packaging for future improvement thanks it was really fast",
                            "reviewTime": "14 Dec 2020",
                            "reviewer": "Yasir A.",
                        
                        }, {
                            "rating": 5,
                            "reviewContent": "Seems to be ok, good quality.. No weight restriction mentioned on the box.. I'm about 90kg, it could handle my weight so far..",
                            "reviewTime": "22 May 2020",
                            "reviewer": "Kevin",
                        }]
                    },
                }
            },
        },
    };
  </script>

</body>
</html>

I just want to get the review data only so I'd like to know how to extract the value of var __moduleData__ .我只想获取评论数据,所以我想知道如何提取var __moduleData__的值。

You can use a regex to select your variable:您可以使用正则表达式来选择您的变量:

json.loads(re.search(r'var __moduleData__ = ({.*})', response.text).group(1))

Example例子

from bs4 import BeautifulSoup
import json,re,requests

url = 'https://www.lazada.com.my/products/iron-gym-total-upper-body-workout-bar-i467342383.html'
response = requests.get(url)

d = json.loads(re.search(r'var __moduleData__ = ({.*})', response.text).group(1))

d['data']['root']['fields']['seller']

Output输出

{'chatResponsiveRate': {'labelText': 'Chat Response', 'value': '100%'},
 'chatUrl': 'https://pages.lazada.com.my/wow/i/my/im/chat?brandId=21411',
 'hideAllMetrics': False,
 'imEnable': True,
 'imUserId': '100285367',
 'name': 'MR SIX PACK',
 'newSeller': False,
 'percentRate': '96%',
 'positiveSellerRating': {'labelText': 'Seller Ratings', 'value': '96%'},
 'rate': 0.96,
 'rateLevel': 3,
 'sellerId': '1000052649',
 'shipOnTime': {'labelText': 'Ship On Time', 'value': '97%'},
 'shopId': 255007,
 'size': 5,
 'time': 2,
 'type': '4',
 'unit': 'years',
 'url': '//www.lazada.com.my/shop/mr-six-pack/?itemId=467342383&channelSource=pdp'}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM