简体   繁体   中英

Python: parsing JSON-like Javascript data structures (w/ consecutive commas)

I would like to parse JSON-like strings. Their lone difference with normal JSON is the presence of contiguous commas in arrays. When there are two such commas, it implicitly means that null should be inserted in-between. Example:

       JSON-like:  ["foo",,,"bar",[1,,3,4]]
      Javascript:  ["foo",null,null,"bar",[1,null,3,4]]
Decoded (Python):  ["foo", None, None, "bar", [1, None, 3, 4]]

The native json.JSONDecoder class doesn't allow me to change the behavior of the array parsing. I can only modify the parser for objects (dicts), ints, floats, strings (by giving kwargs functions to JSONDecoder() , please see the doc ).

So, does it mean I have to write a JSON parser from scratch? The Python code of json is available but it's quite a mess. I would prefer to use its internals instead of duplicating its code!

Since what you're trying to parse isn't JSON per se , but rather a different language that's very much like JSON, you may need your own parser.

Fortunately, this isn't as hard as it sounds. You can use a Python parser generator like pyparsing . JSON can be fully specified with a fairly simple context-free grammar (I found one here ), so you should be able to modify it to fit your needs.

Small & simple workaround to try out:

  1. Convert JSON-like data to strings.
  2. Replace ",," with ",null,".
  3. Convert it to whatever is your representation.
  4. Let JSONDecoder(), do the heavy lifting.

    1. & 3. can be omitted if you already deal with strings.

(And if converting to string is impractical, update your question with this info!)

You can do the comma replacement of Lattyware's / przemo_li's answers in one pass by using a lookbehind expression, ie "replace all commas that are preceded by just a comma":

>>> s = '["foo",,,"bar",[1,,3,4]]'

>>> re.sub(r'(?<=,)\s*,', ' null,', s)
'["foo", null, null,"bar",[1, null,3,4]]'

Note that this will work for small things where you can assume there aren't consecutive commas in string literals, for example. In general, regular expressions aren't enough to handle this problem, and Taymon's approach of using a real parser is the only fully correct solution.

It's a hackish way of doing it, but one solution is to simply do some string modification on the JSON-ish data to get it in line before parsing it.

import re
import json

not_quite_json = '["foo",,,"bar",[1,,3,4]]'
not_json = True
while not_json:
    not_quite_json, not_json = re.subn(r',\s*,', ', null, ', not_quite_json)

Which leaves us with:

'["foo", null, null, "bar",[1, null, 3,4]]'

We can then do:

json.loads(not_quite_json)

Giving us:

['foo', None, None, 'bar', [1, None, 3, 4]]

Note that it's not as simple as a replace, as the replacement also inserts commas that can need replacing. Given this, you have to loop through until no more replacements can be made. Here I have used a simple regex to do the job.

I've had a look at Taymon recommendation, pyparsing, and I successfully hacked the example provided here to suit my needs. It works well at simulating Javascript eval() but fails one situation : trailing commas. There should be a optional trailing comma – see tests below – but I can't find any proper way to implement this.

from pyparsing import *

TRUE = Keyword("true").setParseAction(replaceWith(True))
FALSE = Keyword("false").setParseAction(replaceWith(False))
NULL = Keyword("null").setParseAction(replaceWith(None))

jsonString = dblQuotedString.setParseAction(removeQuotes)
jsonNumber = Combine(Optional('-') + ('0' | Word('123456789', nums)) +
                    Optional('.' + Word(nums)) +
                    Optional(Word('eE', exact=1) + Word(nums + '+-', nums)))

jsonObject = Forward()
jsonValue = Forward()
# black magic begins
commaToNull = Word(',,', exact=1).setParseAction(replaceWith(None))
jsonElements = ZeroOrMore(commaToNull) + Optional(jsonValue) + ZeroOrMore((Suppress(',') + jsonValue) | commaToNull)
# black magic ends
jsonArray = Group(Suppress('[') + Optional(jsonElements) + Suppress(']'))
jsonValue << (jsonString | jsonNumber | Group(jsonObject) | jsonArray | TRUE | FALSE | NULL)
memberDef = Group(jsonString + Suppress(':') + jsonValue)
jsonMembers = delimitedList(memberDef)
jsonObject << Dict(Suppress('{') + Optional(jsonMembers) + Suppress('}'))

jsonComment = cppStyleComment
jsonObject.ignore(jsonComment)

def convertNumbers(s, l, toks):
    n = toks[0]
    try:
        return int(n)
    except ValueError:
        return float(n)

jsonNumber.setParseAction(convertNumbers)

def test():
    tests = (
        '[1,2]',       # ok
        '[,]',         # ok
        '[,,]',        # ok
        '[  , ,  , ]', # ok
        '[,1]',        # ok
        '[,,1]',       # ok
        '[1,,2]',      # ok
        '[1,]',        # failure, I got [1, None], I should have [1]
        '[1,,]',       # failure, I got [1, None, None], I should have [1, None]
    )
    for test in tests:
        results = jsonArray.parseString(test)
        print(results.asList())

For those looking for something quick and dirty to convert general JS objects (to dicts). Some part of the page of one real site gives me some object I'd like to tackle. There are 'new' constructs for dates, and it's in one line, no spaces in between, so two lines suffice:

data=sub(r'new Date\(([^)])*\)', r'\1', data)
data=sub(r'([,{])(\w*):', r'\1"\2":', data)

Then json.loads() worked fine. Your mileage may vary:)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM