I have json datafiles in several directories that I want to import into Pandas to do some data analysis. The format of the json depends on the type defined in the directory name. For example,
dir1_typeA/
file1
file2
...
dir1_typeB/
file1
file2
...
dir2_typeB/
file1
...
dir2_typeA/
file1
file2
Each file
contains a complex nested json string that will be a row of the DataFrame. I will have two data frames for each TypeA and TypeB. Later on I will append them if needed.
So, far I've got all the files paths I need with os.walk and am trying to go through
import os
from glob import glob
PATH = 'dir/filepath'
files = [y for x in os.walk(PATH) for y in glob(os.path.join(x[0], 'file*'))]
for file in files:
with open(issuefile, 'r') as f:
data = f.read()
data_json = json_normalize(json.loads(data))
type = ' '.join(issuefile.split('/')[3]
data_json['type'] = type
# append to data frame for typeA and typeB
if 'typeA' in type:
# append to typeA dataframe
else:
# append to typeB dataframe
There is one added issue, which is files inside a directory may have slightly different fields. For example, file1
may have a few more fields that file2
in dir1_typeA
. So, I need to accommodate that dynamic nature in data frame for each type as well.
How do I create these two dataframes?
I think you should concatenate the files together first before you read them into pandas, here is how you'd do it in bash (you could also do it in Python):
cat `find *typeA` > typeA
cat `find *typeB` > typeB
Then you can import it into pandas using io.json.json_normalize
:
import json
with open('typeA') as f:
data = [json.loads(l) for l in f.readlines()]
dfA = pd.io.json.json_normalize(data)
dfA
# that this.first this.second
# 0 otherthing thing thing
# 1 otherthing thing thing
# 2 otherthing thing thing
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.