I have a list of dictionaries in event_records
and a subset of the list is below. Each dictionary contains 2 or 3 key-value pairs. The first key is item
and the corresponding value is event#status
.
The second key is count
and the corresponding value consists of a dictionary containing 8 key-value pairs + 1 key-value pair where the value is a list of 9 dictionaries each containing 3 key-value pairs.
The third key (only present some of the time) is errors
and the corresponding value is a dictionary with 3 key-value pairs in a list.
What is the most efficient way to convert the below list of dictionaries in event_records
into a pandas dataframe? I tried the following code, but the speed and performance are very slow.
from pandas.io.json import json_normalize
import pandas as pd
df1 = json_normalize(event_records)
df2 = df1['customEvents']
custom_events_list = []
for element in df2:
df3 = json_normalize(element)
df4 = df3[['type', 'value']]
df5 = df4.T
df5.columns = df5.iloc[0]
df5 = df5[1:]
custom_events_list.append(df5)
df6 = pd.concat(custom_events_list)
df6 = df6.reset_index(drop = True)
df7 = df1.join(df6)
df8 = df1['errors']
event_error_list = []
for element in df8:
df9 = json_normalize(element)
df10 = df9[['response', 'feedback']]
event_error_list.append(df10)
df11 = pd.concat(event_error_list)
df11 = df11.reset_index(drop = True)
df12 = df7.join(df11)
df13 = df12[['old_id', 'new_id', 'event_id', 'event_time', 'value', 'quantity', 'unique_id', 'A3', 'A4', 'A6', 'A9', 'A10', 'A11', 'A12', 'A13', 'A14', 'response', 'feedback']]
event_records = [{'item': 'event#status',
'count': {'item': 'event#count',
'old_id': '123',
'new_id': '456',
'event_id': '111',
'event_time': '1200',
'value': 1.0,
'quantity': '1',
'unique_id': '222',
'customEvents': [{'item': 'event#custom', 'type': 'A3', 'value': ''},
{'item': 'event#custom', 'type': 'A4', 'value': '11AA'},
{'item': 'event#custom', 'type': 'A6', 'value': 'AAB1'},
{'item': 'event#custom', 'type': 'A9', 'value': ''},
{'item': 'event#custom', 'type': 'A10', 'value': '10.5'},
{'item': 'event#custom', 'type': 'A11', 'value': 'ABC'},
{'item': 'event#custom', 'type': 'A12', 'value': 'NYR'},
{'item': 'event#custom', 'type': 'A13', 'value': 'NYR'},
{'item': 'event#custom', 'type': 'A14', 'value': 'NYR'}]},
'errors': [{'item': 'event#Error',
'response': 'NONE',
'feedback': 'Event not found'}]},
{'item': 'event#status',
'count': {'item': 'event#count',
'old_id': '567',
'new_id': '789',
'event_id': '333',
'event_time': '1400',
'value': 1.0,
'quantity': '1',
'unique_id': '444',
'customEvents': [{'item': 'event#custom', 'type': 'A3', 'value': ''},
{'item': 'event#custom', 'type': 'A4', 'value': '22BB'},
{'item': 'event#custom', 'type': 'A6', 'value': 'CCD1'},
{'item': 'event#custom', 'type': 'A9', 'value': ''},
{'item': 'event#custom', 'type': 'A10', 'value': '20.5'},
{'item': 'event#custom', 'type': 'A11', 'value': 'ABC'},
{'item': 'event#custom', 'type': 'A12', 'value': 'NYR'},
{'item': 'event#custom', 'type': 'A13', 'value': 'NYR'},
{'item': 'event#custom', 'type': 'A14', 'value': 'NYR'}]}}]
The desired Pandas dataframe output is as follows:
old_id new_id event_id event_time value quantity unique_id A3 A4 A6 A9 A10 A11 A12 A13 A14 response feedback
123 456 111 1200 1.0 1 222 11AA AAB1 10.5 ABC NYR NYR NYR NONE Event not found
567 789 333 1400 1.0 1 444 22BB CCD1 20.5 ABC NYR NYR NYR
Adding to data-frames is a slow process because each addition recreates the entire object. In your code you create 13 data-frames. I recommend you do all of the formatting outside of the data-frame object, and then create the data-frame in one fell swoop. There are multiple ways to create a data-frame (see this geeks for geeks page for a few examples), and you could choose whichever is easiest for you
The way that seems the fastest to me would be to iterate through the event records list as follows:
processed_records = []
for event_record in event_records:
processed_records.append(process_record(event_record))
df = pd.DataFrame(processed_records)
Then you need to write a function called "process_record") that extracts all the relevant data out of the event records and returns it in a dictionary format (eg {"old_id": 123, "new_id": 345, "event_id": 567..."feedback": None}
). There are a few quirks you'll have to make sure to pay attention to. Because some records dont have errors, you need to make sure you add "None" or -1 or some other value to indicate a null value on this column. Otherwise you will get a "Nan" in your column in pandas. This will take a bit of tedious code, but it will be much faster than the version where you create 12 unnecessary data-frames.
EDIT: Clarified code
The data processing here is quite elegant thanks to pandas json_normalize & list comprehension.
first extract the custom events
parent_fields = ['old_id', 'new_id', 'event_id', 'event_time', 'value', 'quantity', 'unique_id']
custom_events = json_normalize(
[r['count'] for r in event_records],
'customEvents',
parent_fields,
record_prefix='#'
)
then extract the errors
. Here I use a lesser known feature of list comprehensions that allow both filtering & iteration on nested elements to generate the records that are fed into the DataFrame constructor
errors = pd.DataFrame(
[(e['response'], e['feedback'],r['count']['unique_id'])
for r in event_records if 'errors' in r
for e in r['errors']],
columns=['response', 'feedback', 'unique_id'])
merge the two dataframes
df = custom_events.merge(
errors,
left_on='unique_id',
right_on='unique_id',
how='left'
)
shaped = df.set_index(
[c for c in df.columns if c != '#value']
).unstack('#type')
at this point, shaped
is a dataframe with the desired shape, however the columns are still a multi-index instead of a flat list.
#shaped outputs:
#value
#type A10 A11 A12 A13 A14 A3 A4 A6 A9
old_id new_id event_id event_time value quantity unique_id response feedback
123 456 111 1200 1.0 1 222 NONE Event not found 10.5 ABC NYR NYR NYR 11AA AAB1
567 789 333 1400 1.0 1 444 NaN NaN 20.5 ABC NYR NYR NYR 22BB CCD1
set the columns to the 2nd level in the multi-index & reset the dataframe's index, and if you want to you can re-order the columns
shaped.columns = shaped.columns.levels[1]
shaped.reset_index()
# outputs:
#type old_id new_id event_id event_time value quantity unique_id response feedback A10 A11 A12 A13 A14 A3 A4 A6 A9
0 123 456 111 1200 1.0 1 222 NONE Event not found 10.5 ABC NYR NYR NYR 11AA AAB1
1 567 789 333 1400 1.0 1 444 NaN NaN 20.5 ABC NYR NYR NYR 22BB CCD1
i'll suggest we create the three dataframes and concatenate after. Also, some of the data is nested in lists, nested in dicts, and nested in lists. some journey. Personally, i use a library(jmespath) to make the journey easier, and IMHO, simpler. as i said, personally. here goes:
import jmespath
from collections import defaultdict
First we create the dataframe for the ids; the nesting isn't so deep here, a list comprehension(well a nested list comprehension) should do the trick, plus it is a more sensible approach here:
df1 = pd.DataFrame({key:value
for key,value
in entry['count'].items()
if key not in ('customEvents','item')}
for entry in event_records)
df1
old_id new_id event_id event_time value quantity unique_id
0 123 456 111 1200 1.0 1 222
1 567 789 333 1400 1.0 1 444
Second dataframe to pull out is the 'As'; this is where jmespath comes into play, as it allows easy traversal of nested lists/dicts. you could write a nested list comprehension here, but jmespath allows us avoid that nestedness:
the path to customEvents is: list -> dict -> count -> customEvents -> list<br>
keys are accessed in jmespath via the dot(.) symbol, while lists are accessed via the brackets([]) symbol
As =jmespath.compile('[].count.customEvents[]')
out = As.search(event_records)
print(out)
[{'item': 'event#custom', 'type': 'A3', 'value': ''},
{'item': 'event#custom', 'type': 'A4', 'value': '11AA'},
{'item': 'event#custom', 'type': 'A6', 'value': 'AAB1'},
{'item': 'event#custom', 'type': 'A9', 'value': ''},
{'item': 'event#custom', 'type': 'A10', 'value': '10.5'},
{'item': 'event#custom', 'type': 'A11', 'value': 'ABC'},
{'item': 'event#custom', 'type': 'A12', 'value': 'NYR'},
{'item': 'event#custom', 'type': 'A13', 'value': 'NYR'},
{'item': 'event#custom', 'type': 'A14', 'value': 'NYR'},
{'item': 'event#custom', 'type': 'A3', 'value': ''},
{'item': 'event#custom', 'type': 'A4', 'value': '22BB'},
{'item': 'event#custom', 'type': 'A6', 'value': 'CCD1'},
{'item': 'event#custom', 'type': 'A9', 'value': ''},
{'item': 'event#custom', 'type': 'A10', 'value': '20.5'},
{'item': 'event#custom', 'type': 'A11', 'value': 'ABC'},
{'item': 'event#custom', 'type': 'A12', 'value': 'NYR'},
{'item': 'event#custom', 'type': 'A13', 'value': 'NYR'},
{'item': 'event#custom', 'type': 'A14', 'value': 'NYR'}]
next, we use the defaultdict option to extract our type and value keys
d = defaultdict(list)
for i in out:
d[i['type']].append(i['value'])
print(d)
defaultdict(list,
{'A3': ['', ''],
'A4': ['11AA', '22BB'],
'A6': ['AAB1', 'CCD1'],
'A9': ['', ''],
'A10': ['10.5', '20.5'],
'A11': ['ABC', 'ABC'],
'A12': ['NYR', 'NYR'],
'A13': ['NYR', 'NYR'],
'A14': ['NYR', 'NYR']})
read it into a dataframe:
df2 = pd.DataFrame(d)
df2
A3 A4 A6 A9 A10 A11 A12 A13 A14
0 11AA AAB1 10.5 ABC NYR NYR NYR
1 22BB CCD1 20.5 ABC NYR NYR NYR
third part is to extract the errors data: same concept of []
for lists and .
for keys apply here as well; however, we can get back our data in a key:value
pair, like a dict:
errors = jmespath.compile('[].errors[].{response:response,feedback:feedback}')
err = errors.search(event_records)
print(err)
[{'response': 'NONE', 'feedback': 'Event not found'}]
read into a dataframe:
df3 = pd.DataFrame(err)
df3
response feedback
0 NONE Event not found
and we are at the end - concatenate the dataframes on the columns:
result = pd.concat([df1,df2,df3],axis = 1)
old_id new_id event_id event_time value quantity unique_id A3 A4 A6 A9 A10 A11 A12 A13 A14 response feedback
0 123 456 111 1200 1.0 1 222 11AA AAB1 10.5 ABC NYR NYR NYR NONE Event not found
1 567 789 333 1400 1.0 1 444 22BB CCD1 20.5 ABC NYR NYR NYR NaN NaN
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.