简体   繁体   中英

How can I count and keep track of values in a huge json file

I have a huge json file , it has a key call type(the type of crime commited), date and time(date crime was commited) , and location(address or lat&long) among other keys with values. Im mostly interested in counting the days with the most crimes , counting what call types show up the most, and what location shows up the most also, the location can measure by the home address or pairing the latitude and longitude together. Python would probably be best . THERES OVER 350 TYPES OF CALL TYPES ON A JSON WITH OVER 350K DATA ROWS. So everything time you see a new call type it should like create a new variable for that and keep track of it

i tried iterating threw it like a list but having issues . how can i attach to my code when its 62 mb , should i link to a file?

this is an example of data

[{"A": "incident_num", "B": "date_time", "C": "day", "D": "stno", "E": "stdir1", "F": "StreetName", "G": "streettype", "H": "FullAddress", "I": "call_type", "J": "disposition", "K": "beat", "L": "priority", "M": "lat", "N": "long"},
{"A": "P17060024503", "B": "6/14/2017 21:54", "C": "4", "D": "10", "E": "", "F": "14TH", "G": "ST", "H": "10 14TH ST, San Diego, CA", "I": "1151", "J": "O", "K": "521", "L": "2", "M": "32.7054489", "N": "-117.1518696"},
{"A": "P17030051227", "B": "3/29/2017 22:24", "C": "4", "D": "10", "E": "", "F": "14TH", "G": "ST", "H": "10 14TH ST, San Diego, CA", "I": "1016", "J": "A", "K": "521", "L": "2", "M": "32.7054489", "N": "-117.1518696"},
{"A": "P17060004814", "B": "6/3/2017 18:04", "C": "7", "D": "10", "E": "", "F": "14TH", "G": "ST", "H": "10 14TH ST, San Diego, CA", "I": "1016", "J": "A", "K": "521", "L": "2", "M": "32.7054489", "N": "-117.1518696"},
{"A": "P17030029336", "B": "3/17/2017 10:57", "C": "6", "D": "10", "E": "", "F": "14TH", "G": "ST", "H": "10 14TH ST, San Diego, CA", "I": "1151", "J": "OT", "K": "521", "L": "2", "M": "32.7054489", "N": "-117.1518696"},
{"A": "P17030005412", "B": "3/3/2017 23:45", "C": "6", "D": "10", "E": "", "F": "15TH", "G": "ST", "H": "10 15TH ST, San Diego, CA", "I": "911P", "J": "CAN", "K": "521", "L": "2", "M": "32.7057215", "N": "-117.1503498"},
{"A": "P17020016091", "B": "2/10/2017 8:23", "C": "6", "D": "10", "E": "", "F": "15TH", "G": "ST", "H": "10 15TH ST, San Diego, CA", "I": "AU2", "J": "W", "K": "521", "L": "2", "M": "32.7057215", "N": "-117.1503498"},
{"A": "P17040017368", "B": "4/11/2017 4:57", "C": "3", "D": "10", "E": "", "F": "15TH", "G": "ST", "H": "10 15TH ST, San Diego, CA", "I": "5150", "J": "CAN", "K": "521", "L": "2", "M": "32.7057215", "N": "-117.1503498"},
{"A": "P17030048050", "B": "3/28/2017 6:30", "C": "3", "D": "10", "E": "", "F": "15TH", "G": "ST", "H": "10 15TH ST, San Diego, CA", "I": "1146", "J": "K", "K": "521", "L": "", "M": "32.7057215", "N": "-117.1503498"},
{"A": "P17060037341", "B": "6/22/2017 10:19", "C": "5", "D": "10", "E": "", "F": "15TH", "G": "ST", "H": "10 15TH ST, San Diego, CA", "I": "242", "J": "K", "K": "521", "L": "1", "M": "32.7057215", "N": "-117.1503498"},
{"A": "P17060008467", "B": "6/5/2017 19:27", "C": "2", "D": "10", "E": "", "F": "15TH", "G": "ST", "H": "10 15TH ST, San Diego, CA", "I": "5150", "J": "K", "K": "521", "L": "2", "M": "32.7057215", "N": "-117.1503498"},

i just want stats for like each call type that was made and how much time it was made , or what location has most crimes , what date had the most crimes etc ..

Use pandas :

import pandas as pd

raw_df = pd.DataFrame(data)
df = raw_df.rename(columns=raw_df.iloc[0]).drop(0)
df

Output:

    incident_num        date_time day stno stdir1 StreetName      ...      call_type disposition beat priority         lat          long
1   P17060024503  6/14/2017 21:54   4   10              14TH      ...           1151           O  521        2  32.7054489  -117.1518696
2   P17030051227  3/29/2017 22:24   4   10              14TH      ...           1016           A  521        2  32.7054489  -117.1518696
3   P17060004814   6/3/2017 18:04   7   10              14TH      ...           1016           A  521        2  32.7054489  -117.1518696
4   P17030029336  3/17/2017 10:57   6   10              14TH      ...           1151          OT  521        2  32.7054489  -117.1518696
5   P17030005412   3/3/2017 23:45   6   10              15TH      ...           911P         CAN  521        2  32.7057215  -117.1503498
6   P17020016091   2/10/2017 8:23   6   10              15TH      ...            AU2           W  521        2  32.7057215  -117.1503498
7   P17040017368   4/11/2017 4:57   3   10              15TH      ...           5150         CAN  521        2  32.7057215  -117.1503498
8   P17030048050   3/28/2017 6:30   3   10              15TH      ...           1146           K  521           32.7057215  -117.1503498
9   P17060037341  6/22/2017 10:19   5   10              15TH      ...            242           K  521        1  32.7057215  -117.1503498
10  P17060008467   6/5/2017 19:27   2   10              15TH      ...           5150           K  521        2  32.7057215  -117.1503498

Example of queries you can run:

>>> df['call_type'].value_counts()
5150    2
1016    2
1151    2
242     1
911P    1
AU2     1
1146    1

Iterate the json file and store the required fields in assosiatve array. You can perform operation on it.

If the data has fixed column and structure you can store it in database like MySql and you can perform your required operations easily with simple queries.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM