简体   繁体   中英

How would I parse JavaScript to JSON in python?

I have a html page which contains the entire product list for a set of items. Due to the size of the page I cannot upload it. Unfortunately, the products are located within the script section, more specifically into one variable.

At first I thought this was plain JSON, however, after multiple attempts to decode the response with json.loads and pyjson5.loads I figured it was more or less the syntax of the language.

Here is a snippet of the code:

window.INIT_STATE = 'configuration': {'navigationData': {'catalog': {'id': 1, 'active': 1, 'tenant': 'pyStore', 'type': 'catalog', 'name': 'Initial catalog', 'version': '2021-06-02T16:26:56.446Z', 'nav': 

I'm still not completely sure if this JavaScript or JSON but I have no clue on how to parse this data as there always seems to be an issue with either the delimiter or quotation marks.

Are there any valid functions that could at least help me identify this code/parse it?

What you have there is an object that can be serialized into JSON data, but it's not JSON itself. Let me explain the difference.

The following snippets are in Javascript.

This is an example of a Javascript object that can be serialized into JSON

{ x: 2 }

The following string is an example of JSON-formatted data. (Note that the JSON data itself is just a string that's formatted in a very specific way. JSON is always just a string, just like XML)

'{"x":2}'

The following is an example of serializing a Javascript object into JSON format (ie we're turning an object into a JSON string).

> JSON.stringify({ x: 2 })
'{"x":2}'

See the difference? You'll find many people online calling JSON-serializable data "JSON" (which is fine, sometimes people get lazy with their speech, or don't fully understand), but it's technically not JSON, it's just data that can be turned into JSON if wanted (an object with functions, for example, is not JSON serializable - you really can't encode functions into a string).

With that said, what you've got there is just a snippet of Javascript, which, if executed will put a JSON-serializable object into a variable. however , this source code itself does not contain properly formatted JSON data (for example, the quotes would have to be double quotes - single quotes aren't allowed in JSON). Therefore, no JSON parsing utility will be able to operate on that.

Unfortunately, You're going to have to parse this data by hand, which can take a bit of work to do. How much work depends on your needs. If all you're trying to do is extract one specific property from that data, then it may be possible to just do a regex search for the appropriate key, and extract the value (though, you have to know that the key you're looking for doesn't appear anywhere else in the object).

Update:

If all you're wanting to do is extract the JSON data and save it off somewhere else, then maybe it would be better to do it in Javascript instead of python, as what you're dealing with is Javascript source code. Here's what you can do.

Edit the HTML file and take out everything but this JSON-serializable structure, change window.INIT_STATE to just const INIT_STATE , and at the end of the file, add this:

const INIT_STATE = ...your giant JSON-serializable structure...

require('fs').writeFileSync('./output.json', JSON.stringify(INIT_STATE), 'utf-8')

Rename your html file to have a ".js" file extension instead.

You'll need node installed to run this. Once you've installed node, run your file using node yourFile.js . It should create a file called "output.json" in your same directory.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM