简体   繁体   中英

How to import normalised json data from several files into a pandas dataframe?

I have json datafiles in several directories that I want to import into Pandas to do some data analysis. The format of the json depends on the type defined in the directory name. For example,

dir1_typeA/
  file1
  file2
  ...
dir1_typeB/
  file1
  file2
  ...
dir2_typeB/
  file1
  ...
dir2_typeA/
  file1
  file2

Each file contains a complex nested json string that will be a row of the DataFrame. I will have two data frames for each TypeA and TypeB. Later on I will append them if needed.

So, far I've got all the files paths I need with os.walk and am trying to go through

    import os
    from glob import glob

    PATH = 'dir/filepath'
    files = [y for x in os.walk(PATH) for y in glob(os.path.join(x[0], 'file*'))]

    for file in files:
        with open(issuefile, 'r') as f:
            data = f.read()

        data_json = json_normalize(json.loads(data))
        type = ' '.join(issuefile.split('/')[3]
        data_json['type'] = type
        # append to data frame for typeA and typeB
        if 'typeA' in type:
            # append to typeA dataframe
        else:
            # append to typeB dataframe

There is one added issue, which is files inside a directory may have slightly different fields. For example, file1 may have a few more fields that file2 in dir1_typeA . So, I need to accommodate that dynamic nature in data frame for each type as well.

How do I create these two dataframes?

I think you should concatenate the files together first before you read them into pandas, here is how you'd do it in bash (you could also do it in Python):

cat `find *typeA` > typeA
cat `find *typeB` > typeB

Then you can import it into pandas using io.json.json_normalize :

import json
with open('typeA') as f:
    data = [json.loads(l) for l in f.readlines()]
    dfA = pd.io.json.json_normalize(data)

dfA

#          that this.first this.second
# 0  otherthing      thing       thing
# 1  otherthing      thing       thing
# 2  otherthing      thing       thing

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM