如何刮<script text/javascript>

Question

so I am trying to figure out how I can possible scrape a javascript tag using regex which I believe might be the easiest way. 因此，我试图弄清楚如何使用正则表达式来抓取javascript标签，我认为这可能是最简单的方法。

The tag looks like: 标签看起来像：

<script type="text/javascript">

var spConfig=newApex.Config({
  "attributes": {
    "199": {
      "id": "199",
      "code": "legend",
      "label": "Weapons",
      "options": [
        {
          "label": "10",
          "priceInGame": "0",          
          "id": [

          ]
        },
        {
          "label": "10.5",
          "priceInGame": "0",          
          "id": [

          ]
        },
        {
          "label": "11",
          "priceInGame": "0",          
          "id": [
            "66659"
          ]
        },
        {
          "label": "11.5",
          "priceInGame": "0",          
          "id": [            
          ]
        },
        {
          "label": "12",
          "priceInGame": "0",          
          "id": [

          ]
        },
        {
          "label": "12.5",
          "priceInGame": "0",          
          "id": [           
          ]
        },
        {
          "label": "13",
          "priceInGame": "0",         
          "id": [

          ]
        },
        {
          "label": "4",
          "priceInGame": "0",          
          "id": [

          ]
        },
        {
          "label": "4.5",
          "priceInGame": "0",          
          "id": [

          ]
        },
        {
          "label": "5",
          "priceInGame": "0",         
          "id": [

          ]
        },
        {
          "label": "5.5",
          "priceInGame": "0",        
          "id": [

          ]
        },
        {
          "label": "6",
          "priceInGame": "0",         
          "id": [

          ]
        },
        {
          "label": "6.5",
          "priceInGame": "0",         
          "id": [

          ]
        },
        {
          "label": "7",
          "priceInGame": "0",         
          "id": [

          ]
        },
        {
          "label": "7.5",
          "priceInGame": "0",         
          "id": [

          ]
        },
        {
          "label": "8",
          "priceInGame": "0",          
          "id": [
            "66672"
          ]
        },
        {
          "label": "8.5",
          "priceInGame": "0",          
          "id": [
            "66673"
          ]
        },
        {
          "label": "9",
          "priceInGame": "0",          
          "id": [

          ]
        },
        {
          "label": "9.5",
          "priceInGame": "0",        
          "id": [
            "66675"
          ]
        }
      ]
    }
  },
  "weaponID": "66733",
  "chooseText": "Apex Legends",
  "Config": {
    "includeCoins": false,
  }
});

</script>

and I want to scrape all Label 我想刮所有标签

Whaht I tried to do is: 我试图做的是：

        for nosto_sku_tag in bs4.find_all('script', {'type': 'text/javascript'}):
            try:
                test = re.findall('var spConfig = (\{.*}?);', nosto_sku_tag.text.strip())
                print(test)
            except:  # noqa
                continue

but it only returned an empty value of [] 但它只返回[]的空值

so I am here asking what can I do to be able to scrape the labels? 所以我在这里问我该怎么做才能刮标签？

Answer 1

You need to specify the attribute using attr=value or attrs={'attr': 'value'} syntax. 您需要使用attr=value或attrs={'attr': 'value'}语法指定属性。

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-keyword-arguments https://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-keyword-arguments

import json
import re
from ast import literal_eval

from bs4 import BeautifulSoup

if __name__ == '__main__':
    html = '''
<script type="text/javascript">

var spConfig=newApex.Config({
  "attributes": {
    "199": {
      "id": "199",
      "code": "legend",
      "label": "Weapons",
      "options": [
        { "label": "10", "priceInGame": "0", "id": [] },
        { "label": "10.5", "priceInGame": "0", "id": [] },
        { "label": "11", "priceInGame": "0", "id": [ "66659" ] },
        { "label": "7.5", "priceInGame": "0", "id": [] },
        { "label": "8", "priceInGame": "0", "id": ["66672"] }
      ]
    }
  },
  "weaponID": "66733",
  "chooseText": "Apex Legends",
  "taxConfig": {
    "includeCoins": False,
  }
});

</script>    
    '''

    soup = BeautifulSoup(html, 'html.parser')
    # this one works too
    # script = soup.find('script', attrs={'type':'text/javascript'})
    script = soup.find('script', type='text/javascript')
    js: str = script.text.replace('\n', '')
    raw_json = re.search('var spConfig=newApex.Config\(({.*})\);', js, flags=re.MULTILINE).group(1)
    # if `"includeCoins": False,` weren't in the JSON,
    # you could have used json.loads() but it fails here.
    # Yet, ast.literal_eval works fine.
    data = literal_eval(raw_json)
    labels = [opt['label'] for opt in data['attributes']['199']['options']]
    print(labels)

output: 输出：

['10', '10.5', '11', '7.5', '8'] ... some removed for brevity

Answer 2

If you are just looking for the entire row field in the JSON object, use the following; 如果只是在JSON对象中查找整个行字段，请使用以下命令；

("label":) "([^"]+)",

Then if you want to return the actual value, just use 然后，如果要返回实际值，请使用

\2

to pull back the second group 拉第二组

如何刮<script text/javascript>

问题描述

2 个解决方案

解决方案1
1 已采纳 2019-07-17 08:33:45

解决方案2
0 2019-07-17 08:41:30

如何刮<script text/javascript>

问题描述

2 个解决方案

解决方案1 1 已采纳 2019-07-17 08:33:45

解决方案2 0 2019-07-17 08:41:30

解决方案1
1 已采纳 2019-07-17 08:33:45

解决方案2
0 2019-07-17 08:41:30