I am trying to aggregate my dataset multiple times and I can't seem to figure out the right way to do so with pandas
. Given a dataset like so:
donations = [
{
"amount": 100,
"organization": {
"name": "Org 1",
"total_budget": 8000,
"states": [
{
"name": "Maine",
"code": "ME"
},
{
"name": "Massachusetts",
"code": "MA"
}
]
}
},
{
"amount": 5000,
"organization": {
"name": "Org 2",
"total_budget": 10000,
"states": [
{
"name": "Massachusetts",
"code": "MA"
}
]
}
},
{
"amount": 5000,
"organization": {
"name": "Org 1",
"total_budget": 8000,
"states": [
{
"name": "Maine",
"code": "ME"
},
{
"name": "Massachusetts",
"code": "MA"
}
]
}
}
]
My desired output is a single aggregation by state of the total_budget
and amount
columns. I have gotten pretty close with the following:
n = pd.json_normalize(donations, record_path=['organization', 'states'], meta=['amount', ['organization', 'total_budget'], ['organization', 'name']], record_prefix='states.')
df = pd.DataFrame(n)
grouped_df = df.groupby(['states.code', 'states.name', 'organization.name', 'organization.total_budget']).sum()
Though what this gives me is a breakdown by state, with the organization names still included:
MA Massachusetts Org 1 8000 5100
Org 2 10000 5000
ME Maine Org 1 8000 5100
I know that I need to keep my initial aggregate function the same way in order to produce the correct results, but I am not sure what the final step is to get my expected results that then group these results by state:
MA Massachusetts 18000 10100
ME Maine 8000 5100
I don't know if this applies to your actual data or not. The approach you created as a sample data limitation divides the data frame by the values you want to aggregate and removes the duplicate rows. It then groups and aggregates and combines the two data frames together.
df_a = df[['states.code', 'states.name', 'organization.name', 'amount']]
df_o = df[['states.code', 'states.name', 'organization.name', 'organization.total_budget']]
df = df_a.groupby(['states.code', 'states.name'])['amount'].sum().reset_index()
df_o.drop_duplicates(inplace=True)
df1 = df_o.groupby(['states.code', 'states.name'])['organization.total_budget'].sum().reset_index()
df1.merge(df, on=['states.code', 'states.name'], how='inner')
states.code states.name organization.total_budget amount
0 MA Massachusetts 18000 10100
1 ME Maine 8000 5100
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.