简体   繁体   中英

parsing a JSON file with three types of formatted data into three separate dataframes

I have JSON data that contains three different types of records that I need to read, parse into each type, and store as three separate data frames. I used dummy data here since the actual records are quite long. The data have different numbers of columns for each record type with some column names in common. The schema are basically:

schema 1: one record per item ID with basic characteristics in wide format that looks like:

[id,year,object_name,base_price]

schema 2 and 3: each are a set of different features in long format for a single id that look like:

[id, feature1, price1]

[id, feature2, price2]

[id, feature1, size1]

[id, feature2, size2]

Here's what the JSON data looks like

[

{"id": ["500068"], "year": ["2018"], "object_name": "barn"},

{"id": ["500069"], "year": ["2018"], "object_name": "well"},

{"id": ["500068"], "year": ["2018"], "feature": "extension","price": "65"},

{"id": ["500068"], "year": ["2018"], "feature": "stalls","price": "72"},

{"id": ["500070"], "year": ["2018"], "feature": "depth_extension","price": "55"},

{"id": ["500070"], "year": ["2018"], "feature": "cover","price": "55"},

{"id": ["500070"], "year": ["2018"], "feature": "depth_extension","size": "1050"},

{"id": ["500070"], "year": ["2018"], "feature": "cover","size": "1160"},

]

I tried getting the JSON into a format that i could try to do this in, but I get stuck here.

import pandas as pd

with open(file, 'r') as f:
    data = f.readlines()

# remove the trailing "\n" from each line
data = map(lambda x: x.rstrip(), data)

# each element of 'data' is an individual JSON object.
# i want to convert it into an *array* of JSON objects
# which, in and of itself, is one large JSON object
# basically... add square brackets to the beginning
# and end, and have all the individual business JSON objects
# separated by a comma
data_json_str = "[" + ','.join(data) + "]"

# now, load it into pandas 
# (I get stuck here since they are different record types
data_df = pd.read_json(data_json_str)

I would like the data to be read into three different pandas data frames with the appropriate column names. Thanks for the help!

To create 3 different data frames from various type of column names of json data, you can try the following code.

import pandas as pd

json_data='[{"id": ["500068"], "year": ["2018"], "object_name": "barn"},{"id": ["500069"], "year": ["2018"], "object_name": "well"},{"id": ["500068"], "year": ["2018"], "feature": "extension","price": "65"},{"id": ["500068"], "year": ["2018"], "feature": "stalls","price": "72"},{"id": ["500070"], "year": ["2018"], "feature": "depth_extension","price": "55"},{"id": ["500070"], "year": ["2018"], "feature": "cover","price": "55"},{"id": ["500070"], "year": ["2018"], "feature": "depth_extension","size": "1050"},{"id": ["500070"], "year": ["2018"], "feature": "cover","size": "1160"},]'
list_data=eval(json_data)

df=pd.DataFrame(list_data)

df1=df[df['size'].notnull()]
df1=df1.drop(['object_name','price'], 1)
print(df1)
print('\n')

df2=df[df['price'].notnull()]
df2=df2.drop(['object_name','size'], 1)
print(df2)
print('\n')

df3=df[df['object_name'].notnull()]
df3=df3.drop(['feature', 'price', 'size'], 1)
print(df3)
print('\n')

The output:

           feature        id  size    year
6  depth_extension  [500070]  1050  [2018]
7            cover  [500070]  1160  [2018]


           feature        id price    year
2        extension  [500068]    65  [2018]
3           stalls  [500068]    72  [2018]
4  depth_extension  [500070]    55  [2018]
5            cover  [500070]    55  [2018]


         id object_name    year
0  [500068]        barn  [2018]
1  [500069]        well  [2018]

Hope this can help you, have a nice day. :)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM