简体   繁体   中英

Python - How can I scrape with bs4 a javascript code)?

So I have been trying to scrape out a value from a html that is a javascript. There is alot of javascript in the code but I just want to be able to print out this one:

var spConfig=newProduct.Config({
  "attributes": {
    "531": {
      "id": "531",
      "options": [
        {
          "id": "18",
          "hunter": "0",
          "products": [
            "128709"
          ]
        },
        {
          "label": "40 1\/2",
          "hunter": "0",
          "products": [
            "120151"
          ]
        },
        {
          "id": "33",
          "hunter": "0",
          "products": [
            "120152"
          ]
        },
        {
          "id": "36",
          "hunter": "0",
          "products": [
            "128710"
          ]
        },
        {
          "id": "42",
          "hunter": "0",
          "products": [
            "125490"
          ]
        }
      ]
    }
  },

  "Id": "120153",

});

So I started by doing a code that looks like:

test = bs4.find_all('script', {'type': 'text/javascript'})
print(test)

The output I am getting is pretty huge so I am not able to post it all here but one of them is the javascript as I mentioned at the top and I want to print out only var spConfig=newProduct.Config .

How am I able to do that, to be able to just print out var spConfig=newProduct.Config.... which I later can use json.loads that convert it to a json where I later on can scrape it more easier?

For any question or something I haven't explained well. I will apprecaite everything in the comment where I can improve myself aswell here in stackoverflow! :)

EDIT:

More example of what bs4 prints out for javascripts

<script type="text/javascript">varoptionsPrice=newProduct.Options({
  "priceFormat": {
    "pattern": "%s\u00a0\u20ac",
    "precision": 2,
    "requiredPrecision": 2,
    "decimalSymbol": ",",
    "groupSymbol": "\u00a0",
    "groupLength": 3,
    "integerRequired": 1
  },
  "showBoths": false,
  "idSuffix": "_clone",
  "skipCalculate": 1,
  "defaultTax": 20,
  "currentTax": 20,
  "tierPrices": [

  ],
  "tierPricesInclTax": [

  ],
  "swatchPrices": null
});</script>,
<script type="text/javascript">var spConfig=newProduct.Config({
  "attributes": {
    "531": {
      "id": "531",
      "options": [
        {
          "id": "18",
          "hunter": "0",
          "products": [
            "128709"
          ]
        },
        {
          "label": "40 1\/2",
          "hunter": "0",
          "products": [
            "120151"
          ]
        },
        {
          "id": "33",
          "hunter": "0",
          "products": [
            "120152"
          ]
        },
        {
          "id": "36",
          "hunter": "0",
          "products": [
            "128710"
          ]
        },
        {
          "id": "42",
          "hunter": "0",
          "products": [
            "125490"
          ]
        }
      ]
    }
  },

  "Id": "120153"
});</script>,
<scripttype="text/javascript">document.observe('dom:loaded',
function(){
  varswatchesConfig=newProduct.ConfigurableSwatches(spConfig);
});</script>

EDIT update 2:

try:
    product_li_tags = bs4.find_all('script', {'type': 'text/javascript'})
except Exception:
    product_li_tags = []


for product_li_tag in product_li_tags:
   try:
        pat = "product.Config\((.+)\);"
        json_str = re.search(pat, product_li_tag, flags=re.DOTALL).group(1)
        print(json_str)
   except:
       pass

#json.loads(json_str)
print("Nothing")
sys.exit()

I can think of possible 3 options - which one you use might depend on the size of the project and how flexible you need it to be

  • Use Regex to extract the objects from the script (fastest, least flexible)

  • Use ANTLR or similar (eg. pyjsparser ) to parse the js grammar

  • Use Selenium or other headless browsers that can interpret the JS for you. With this option, you can use selenium to execute a call to get the value of the variable like this

Regex Example (#1)

>>> script_body = """
    var x=product.Config({
        "key": {"a":1}
});
"""
>>> pat = "product.Config\((.+)\);"
>>> json_str = re.search(pat, script_body, flags=re.DOTALL).group(1)
>>> json.loads(json_str)
{'key': {'a': 1}}
>>> json.loads(json_str)['key']['a']
1

You can use the .text function to get the content within each tag. Then, if you know that you want to grab the code that specifically starts with " varoptionsPrice ", you can filter for that:

soup = BeautifulSoup(myhtml, 'lxml')

script_blocks = soup.find_all('script', {'type': 'text/javascript'})
special_code = ''
for s in script_blocks:
    if s.text.strip().startswith('varOptionsPrice'):
        special_code = s.text
        break

print(special_code)

EDIT: To answer your question in the comments, there are a couple of different ways of extracting the part of the text that has the JSON. You could pass it through a regexp to grab everything between the first left parentheses and before the ); at the end. Though if you want to avoid regexp completely, you could do something like:

json_stuff = special_code[special_code.find('(')+1:special_code.rfind(')')]

Then to make a usable dictionary out of it:

import json
j = json.loads(json_stuff)
print(j['defaultTax'])  # This should return a value of 20

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM