简体   繁体   中英

Normalize nested json columns within DataFrame

I have a dataframe where some columns are fine as they are, while others are deeply nested jsons. I would like to flatten these columns while not losing the data I have in the others. I have been trying many things with the json_normalize function, but with not success so far. Here is the original dicitonary from which I created the dataframe:

{'event_no': 1, 
'post_time': {'$date': 1501780860000}, 
'conditions': 'FOR MAIDENS, FILLIES TWO YEARS OLD. Weight, 118 lbs. Claiming Price $35,000, For Each $5,000 To $25,000 1 lb.', 
'odds_updated': {'$date': 1501780944000}, 
'status': 'Unknown', 
'distance': '4 1/2F', 
'purse': 2600000, 
'runners': [{'number': 1, 'scratched': False, 'name': 'Indian Myth', 'morning_odds': 4, 'current_odds': 3, 'win_pool': 11293, 'place_pool': 3560, 'show_pool': 1087}, 
            {'number': 4, 'scratched': False, 'name': 'Awesome Omi', 'morning_odds': '7/2', 'current_odds': 10, 'win_pool': 4297, 'place_pool': 1752, 'show_pool': 658}, 
            {'number': 3, 'scratched': False, 'name': 'La Reyna Bella', 'morning_odds': 12, 'current_odds': '7/2', 'win_pool': 9897, 'place_pool': 4047, 'show_pool': 874}, 
            {'number': 6, 'scratched': False, 'name': 'Spirited Tale', 'morning_odds': 6, 'current_odds': 35, 'win_pool': 1347, 'place_pool': 563, 'show_pool': 431}, 
            {'number': 5, 'scratched': False, 'name': 'Lucky Stiff', 'morning_odds': 3, 'current_odds': '5/2', 'win_pool': 12611, 'place_pool': 4506, 'show_pool': 1190}, 
            {'number': 2, 'scratched': False, 'name': 'Helen Rose', 'morning_odds': '5/2', 'current_odds': '3/2', 'win_pool': 19236, 'place_pool': 6709, 'show_pool': 2481}], 
'results': {'finisher': 
              [{'runner_name': 'Helen Rose', 'finish_position': 1, 'win_amount': 5.0, 'place_amount': 2.8, 'show_amount': 2.1, 'dead_heat': False, 'jockey': 'Aby Medina', 'program_number': 2}, 
              {'runner_name': 'Lucky Stiff', 'finish_position': 2, 'win_amount': None, 'place_amount': 3.4, 'show_amount': 2.4, 'dead_heat': False, 'jockey': 'Nicky Figueroa', 'program_number': 5}, 
              {'runner_name': 'Indian Myth', 'finish_position': 3, 'win_amount': None, 'place_amount': None, 'show_amount': 2.6, 'dead_heat': False, 'jockey': 'Gerardo Corrales', 'program_number': 1}], 
           'dividends':
              [{'bet_type': 'EXA', 'finishers': '2-5', 'base_amount': 200, 'amount': 16.6}, 
              {'bet_type': 'TRI', 'finishers': '2-5-1', 'base_amount': 100, 'amount': 18.0},
              {'bet_type': 'SFC', 'finishers': '2-5-1-6', 'base_amount': 100, 'amount': 90.7}]}}

Ideally, I would wish for each key within the jsons to be an individual column with its corresponding value. After making the dictionary into a dataframe, the four columns with the jsons I need to normalize are 'post_time', 'odds_updated', 'runners' and 'results', these last two being the ones I am having trouble with.

I was able to normalize the 'runners' column with the following code, but in the process I am losing all the other columns along with the ids, so I can not concat afterwards either:

df = pd.DataFrame.from_dict(my_dict)
json_df = json.loads(df.to_json(orient="records"))    
flat_df = pd.json_normalize(json_df, record_path=['runners')

When trying this, but with record_path set to 'results', I get the following error:

TypeError: {'event_no': 1, 'post_time': {'$date': 1501780860000}, 'conditions': 'FOR MAIDENS, FILLIES TWO YEARS OLD. Weight, 118 lbs. Claiming Price $35,000, For Each $5,000 To $25,000 1 lb.', 'odds_updated': {'$date': 1501780944000}, 'status': 'Unknown', 'distance': '4 1/2F', 'purse': 2600000, 'runners': [{'number': 1, 'scratched': False, 'name': 'Indian Myth', 'morning_odds': 4, 'current_odds': 3, 'win_pool': 11293, 'place_pool': 3560, 'show_pool': 1087}, {'number': 4, 'scratched': False, 'name': 'Awesome Omi', 'morning_odds': '7/2', 'current_odds': 10, 'win_pool': 4297, 'place_pool': 1752, 'show_pool': 658}, {'number': 3, 'scratched': False, 'name': 'La Reyna Bella', 'morning_odds': 12, 'current_odds': '7/2', 'win_pool': 9897, 'place_pool': 4047, 'show_pool': 874}, {'number': 6, 'scratched': False, 'name': 'Spirited Tale', 'morning_odds': 6, 'current_odds': 35, 'win_pool': 1347, 'place_pool': 563, 'show_pool': 431}, {'number': 5, 'scratched': False, 'name': 'Lucky Stiff', 'morning_odds': 3, 'current_odds': '5/2', 'win_pool': 12611, 'place_pool': 4506, 'show_pool': 1190}, {'number': 2, 'scratched': False, 'name': 'Helen Rose', 'morning_odds': '5/2', 'current_odds': '3/2', 'win_pool': 19236, 'place_pool': 6709, 'show_pool': 2481}], 'results': {'finisher': [{'runner_name': 'Helen Rose', 'finish_position': 1, 'win_amount': 5.0, 'place_amount': 2.8, 'show_amount': 2.1, 'dead_heat': False, 'jockey': 'Aby Medina', 'program_number': 2}, {'runner_name': 'Lucky Stiff', 'finish_position': 2, 'win_amount': None, 'place_amount': 3.4, 'show_amount': 2.4, 'dead_heat': False, 'jockey': 'Nicky Figueroa', 'program_number': 5}, {'runner_name': 'Indian Myth', 'finish_position': 3, 'win_amount': None, 'place_amount': None, 'show_amount': 2.6, 'dead_heat': False, 'jockey': 'Gerardo Corrales', 'program_number': 1}], 'dividends': [{'bet_type': 'EXA', 'finishers': '2-5', 'base_amount': 200, 'amount': 16.6}, {'bet_type': 'TRI', 'finishers': '2-5-1', 'base_amount': 100, 'amount': 18.0}, {'bet_type': 'SFC', 'finishers': '2-5-1-6', 'base_amount': 100, 'amount': 90.7}]}} has non list value {'finisher': [{'runner_name': 'Helen Rose', 'finish_position': 1, 'win_amount': 5.0, 'place_amount': 2.8, 'show_amount': 2.1, 'dead_heat': False, 'jockey': 'Aby Medina', 'program_number': 2}, {'runner_name': 'Lucky Stiff', 'finish_position': 2, 'win_amount': None, 'place_amount': 3.4, 'show_amount': 2.4, 'dead_heat': False, 'jockey': 'Nicky Figueroa', 'program_number': 5}, {'runner_name': 'Indian Myth', 'finish_position': 3, 'win_amount': None, 'place_amount': None, 'show_amount': 2.6, 'dead_heat': False, 'jockey': 'Gerardo Corrales', 'program_number': 1}], 'dividends': [{'bet_type': 'EXA', 'finishers': '2-5', 'base_amount': 200, 'amount': 16.6}, {'bet_type': 'TRI', 'finishers': '2-5-1', 'base_amount': 100, 'amount': 18.0}, {'bet_type': 'SFC', 'finishers': '2-5-1-6', 'base_amount': 100, 'amount': 90.7}]} for path results. Must be list or null.

Any suggestions? Thanks in advance.

You can use the meta argument of pd.json_normalize to add additional columns to your data frame:

runners = pd.json_normalize(my_dict, "runners", meta="event_no")
finishers = pd.json_normalize(my_dict, ["results", "finisher"], meta="event_no")

Now you can join the two data frames on (runners.event_no, runners.number) == (finishers.event_no, finishers.program_number) .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM