简体   繁体   中英

Accessing JSON column after reading multiple objects

The Problem

I'm having an issue trying to solve a particular problem, I wish to read in a file that contains multiple json objects and access each object's values at the same time. So far I have it reading the json object in like this:

with open(infile) as file:
    allcontent = []
    for line in file:
        allcontent.append(json.loads(line))

The contents of the list are simply a json object per item:

[{"price": 241, "owner": "brian"}]

[{"price": 243, "owner": "bob"}]

This works and simply appends each json object to the list. However as I wish to calculate for example the highest price across each item in the list I cannot fathom a simple way to perform the operation without a complex loop and assigning variables to track each columns value.

I tried looping over the each json object and accessing the key and value, but I dont want to use this method as it seems like there should be a simpler way to access a column from a list of json objects, specifically for just 1 column of information on each item:

for line in file:
    for key,value in line.items():
        print(key,value)

Question


Using this method does print out each of the rows keys and values, however I need to access all of the rows prices at once to find the highest and lowest. Is there a simpler way than with a loop? such as allcontent['prices']

Dictionaries

I attempted to use a dictionary however as updating the dictionary overrides the previously updated content as the keys are identical "prices" for example and would require a number of conditions to test if the new value is higher or lower than the previously updated one.

For what I gathered from the question (and I might be wrong) your problems seems to reduce to finding the JSON object (which actually gets loaded into a Python dictionary) with the maximum price (for instance), right?

You could just load all the file in memory (put all its items into the allcontent list of dictionaries) the way you're already doing it, then use the built in max function.

import json

with open("data.json", 'r') as f:
    allcontent = []
    for line in f:
        allcontent.append(json.loads(line))

print(max(allcontent, key=lambda x: x['price']))

... which outputs the whole JSON object (aka dictionary ):

{u'owner': u'bob', u'price': 243}

However, since the file itself it's an iterable, you don't even need to preload it in allcontent . You could just do:

with open("data.json", 'r') as f:
    print(max(f, key=lambda x: json.loads(x)['price']))

All this assumes that your file looks exactly like this:

{"price":241,"owner":"brian"}
{"price":243,"owner":"bob"}

... which is not valid JSON

PS 01: I would strongly suggest you don't name your infile 's file descriptor "file" , since that would shadow the built-in file function.

PS 02: As per your comment in the question:

.load did work however as the input file im provided contains a list of objects there were errors when using .load as it is essentially just a string im reading from the file

If you wanted to use json.load , your file needs to be valid JSON. For what you have provided in the example, the closest valid JSON I can think of would be:

[
 {"price":241,"owner":"brian"},
 {"price":243,"owner":"bob"}
]

Notice that it creates a list (starting with [ and ending with ] ) and that every item in the list is separated by a comma (except the last). I personally check Json's validity using the page JSONLint.com (but I'm sure there are many others)

I did some benchmarks. The fastest I could get it was in Method 1 with 1mil lines (I've hashed the code out to generate the data but it takes maybe 30 secs to unhash and make your own). Method 2 and Method 3 are my representations of the answer by BorrajaX (the former of which actually allows you to keep all of the read-in data for further use). Method 4 is your original, with some hope of keeping the value of your print . I removed all print statements.

This is in Python 2.7. But really the gains here are actually pretty small even with 1,000,000 lines of text.

import time
import json
import string
import numpy as np

############################# GENERATE RANDOM DATA #############################

#letters = list(string.ascii_lowercase)
#random_data = ["""{"price": %d, "owner": "%s"}""" % (np.random.randint(1, 1000), 
#                ''.join(np.random.choice(letters, 6, replace=False))) for x 
#                in xrange(1000000)]
#
#with open('pseudo_json.txt', 'w') as outfile:
#    for line in random_data:
#        outfile.write(str(line) +'\n')


time1 = time.time()
#################################### METHOD 1 ##################################

running_max = 0
with open('pseudo_json.txt', 'r') as infile:
    for line in infile:
        price = json.loads(line)['price']
        if price > running_max:
            running_max = price

time2 = time.time()

#################################### METHOD 2 ##################################

with open("pseudo_json.txt", 'r') as f:
    allcontent = []
    for line in f:
        allcontent.append(json.loads(line))

the_max = (max(allcontent, key=lambda x: x['price']))

time3 = time.time()

##################################### METHOD 3 ##############################

the_max = 0
with open("pseudo_json.txt", 'r') as f:
    the_max = (max(f, key=lambda x: json.loads(x)['price']))

time4 = time.time()

#################################### ORIGINAL ##################################

with open("pseudo_json.txt", 'r') as infile:
    allcontent = []
    for line in infile:
        allcontent.append(json.loads(line))

values = []

for line in allcontent:
    for key,value in line.items():
        values.append(value)

the_max = max(values)

time5 = time.time()

################################# READING FILE #################################

with open("pseudo_json.txt", 'r') as infile:
    for line in infile:
        pass

time6 = time.time()

################################### RESULTS ####################################

print "Without storage and max took: {}".format(time2 - time1)
print "With storage and using max took: {}".format(time3-time2)
print "Without storage but using max took: {}".format(time4 - time3)
print "Original took: {}".format(time5 - time4)
print "Reading file took: {}".format(time6 - time5)

The result of json.loads is usually* a regular Python dictionary. That means that, in your example, the allcontent variable is just a list of dictionaries.

You can therefore use Python's min and max functions, combined with a comprehension:

>>> allcontent = [{'price': 1}, {'price': 2}]
>>> min((thing['price'] for thing in allcontent))
1

*: Of course, if you do json.loads("0") you just get an integer.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM