简体   繁体   中英

How to parse this JSON array of arrays (I think)

I am trying to parse a JSON item of the following format:

DataStore.prime('ws-stage-stat', 
{ against: 0, field: 2, stageId: 9155, teamId: 26, type: 8 }, 
[[['goal','fastbreak','leftfoot',[1]],['goal','openplay','leftfoot',[2]], 
['goal','openplay','rightfoot',[1]],['goal','owngoal','leftfoot',[1]],
['goal','penalty','rightfoot',[1]],['miss','corner','header',[6]],
['miss','corner','leftfoot',[2]],['miss','corner','rightfoot',[2]],
['miss','crossedfreekick','header',[1]],['miss','openplay','header',[4]],
['miss','openplay','leftfoot',[11]],['miss','openplay','rightfoot',[27]]]]

The items in quotes represent a description of types of goals scored or chances missed that are listed on a website. The numbers represents the volume. I'm assuming that this is a JSON array of arrays with mixed text and numerical data. What I would like to do is break this down into python variables in the format of

var1 = "'goal','fastbreak','leftfoot'"
var2 = 1

...and repeat for all elements of the above pattern.

The code that is parsing this data structure is this:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import log
from scrapy.cmdline import execute
from scrapy.utils.markup import remove_tags
import time
import re
import json
import requests


class ExampleSpider(CrawlSpider):
    name = "goal2"
    allowed_domains = ["whoscored.com"]
    start_urls = ["http://www.whoscored.com/Teams/32/Statistics/England-Manchester-United"]
    download_delay = 5

    rules = [Rule(SgmlLinkExtractor(allow=('http://www.whoscored.com/Teams/32/Statistics/England-Manchester-United'),deny=('/News', '/Graphics', '/Articles', '/Live', '/Matches', '/Explanations', '/Glossary', 'ContactUs', 'TermsOfUse', 'Jobs', 'AboutUs', 'RSS'),), follow=False, callback='parse_item')]

    def parse_item(self, response):

url = 'http://www.whoscored.com/stagestatfeed'
        params = {
            'against': '0',
            'field': '2',
            'stageId': '9155',
            'teamId': '32',
            'type': '8'
            }
        headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36',
           'X-Requested-With': 'XMLHttpRequest',
           'Host': 'www.whoscored.com',
           'Referer': 'http://www.whoscored.com/'}

        responser = requests.get(url, params=params, headers=headers)

        print responser.text

I've checked the type of responser.text using print type(responser.text) , which returns a result of 'unicode'. Does this mean that this object is now a set of nested Python lists? If so, how can I parse it so that it returns the data in the format that I am after?

Thanks

That's not JSON. JSON doesn't allow single-quoted strings. It also doesn't have constructor calls like that. See the official grammar .

You really want to figure out what format you actually have, and parse it appropriately. Or, better, if you have any control over the output code, fix it to be something that's easy (and safe and efficient) to parse.

At any rate, this looks like a repr of a Python object (in particular, a Datastore.prime object being constructed with a string, a dict, and a list of lists of … as arguments). So, you probably could parse it with eval . Whether that's a good idea or not (possibly with some kind of sanitizing) depends on where you're getting the data from and what your security requirements are.

Or it could just as easily be JavaScript code. Or various other scripting languages. (Most of them have similar structures with similar syntax—which is exactly why they all map between JSON and native data so easily; JSON is basically a subset of the literals for most scripting languages.)

A slightly safer and saner solution would be to explicitly parse out the top level, then use ast.literal_eval to parse out the string, dict, and list components.

A possibly overly complicated solution would be to write a real custom parser.

But the best solution, again, would be to change the source to give you something more useful. Even if you really want to pass a Python object unsafely, pickle is a better idea than repr and eval . But most likely, that isn't what you actually want to do in the first place.

One option would be to utilize a regular expression here:

import re

data = """
DataStore.prime('ws-stage-stat',
{ against: 0, field: 2, stageId: 9155, teamId: 26, type: 8 },
[[['goal','fastbreak','leftfoot',[1]],['goal','openplay','leftfoot',[2]],
['goal','openplay','rightfoot',[1]],['goal','owngoal','leftfoot',[1]],
['goal','penalty','rightfoot',[1]],['miss','corner','header',[6]],
['miss','corner','leftfoot',[2]],['miss','corner','rightfoot',[2]],
['miss','crossedfreekick','header',[1]],['miss','openplay','header',[4]],
['miss','openplay','leftfoot',[11]],['miss','openplay','rightfoot',[27]]]]
"""

# parse js
pattern = re.compile("\[([^\[]+?),\[(\d+)\]\]")

print pattern.findall(data)

Prints:

[
    ("'goal','fastbreak','leftfoot'", '1'), 
    ("'goal','openplay','leftfoot'", '2'),
    ...
    ("'miss','openplay','rightfoot'", '27')
]

\\[([^\\[]+?),\\[(\\d+)\\]\\] would basically match the groups in square brackets. Parenthesis here help to capture certain parts of the matched string; backslashes help to escape characters that have a special meaning in regex, like [ and ] .


Another option, since this looks suspiciously like a part of javascript code, would be to use a javascript parser. I've successfully used slimit module, here are relevant threads with examples:

Running your code and using the response.text you can split the text and get the list of data, then use an ordereddict to hold the required data.

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
import requests


class ExampleSpider(CrawlSpider):
    name = "goal2"
    allowed_domains = ["whoscored.com"]
    start_urls = ["http://www.whoscored.com/Teams/32/Statistics/England-Manchester-United"]
    download_delay = 5

    rules = [Rule(SgmlLinkExtractor(allow=('http://www.whoscored.com/Teams/32/Statistics/England-Manchester-United'),deny=('/News', '/Graphics', '/Articles', '/Live', '/Matches', '/Explanations', '/Glossary', 'ContactUs', 'TermsOfUse', 'Jobs', 'AboutUs', 'RSS'),), follow=False, callback='parse_item')]

    def parse_item(self, response):

        url = 'http://www.whoscored.com/stagestatfeed'
        params = {
            'against': '0',
            'field': '2',
            'stageId': '9155',
            'teamId': '32',
            'type': '8'
            }
        headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36',
           'X-Requested-With': 'XMLHttpRequest',
           'Host': 'www.whoscored.com',
           'Referer': 'http://www.whoscored.com/'}

        responser = requests.get(url, params=params, headers=headers)

        resp = responser.text
        from ast import literal_eval
        from collections import OrderedDict
        d = OrderedDict()
        for line in resp.split():
            if line.startswith("[[["):
                break
        l = literal_eval(line)
        count = 1
        for  sub_ele in l[0]:
            print sub_ele[-1]
            d["var{}".format(count)] = ", ".join(sub_ele[:-1])
            count += 1
            print sub_ele[-1][0],count
            if sub_ele[-1][0]:
                d["var{}".format(count)] = sub_ele[-1][0]
                count +=1
        print d

OrderedDict([('var1', 'goal, corner, rightfoot'), ('var2', 1), ('var3', 'goal, directfreekick, leftfoot'), ('var4', 1), ('var5', 'goal, openplay, leftfoot'), ('var6', 2), ('var7', 'goal, openplay, rightfoot'), ('var8', 2), ('var9', 'miss, corner, header'), ('var10', 5), ('var11', 'miss, corner, rightfoot'), ('var12', 1), ('var13', 'miss, directfreekick, leftfoot'), ('var14', 1), ('var15', 'miss, directfreekick, rightfoot'), ('var16', 2), ('var17', 'miss, openplay, header'), ('var18', 4), ('var19', 'miss, openplay, leftfoot'), ('var20', 14), ('var21', 'miss, openplay, rightfoot'), ('var22', 16)])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM