简体   繁体   中英

Applying aggregate function after using groupby and agg

I am trying to aggregate my dataset multiple times and I can't seem to figure out the right way to do so with pandas . Given a dataset like so:

donations = [
  {
    "amount": 100,
    "organization": {
      "name": "Org 1",
      "total_budget": 8000,
      "states": [
        {
          "name": "Maine",
          "code": "ME"
        },
        {
          "name": "Massachusetts",
          "code": "MA"
        }
      ]
    }
  },
  {
    "amount": 5000,
    "organization": {
      "name": "Org 2",
      "total_budget": 10000,
      "states": [
        {
          "name": "Massachusetts",
          "code": "MA"
        }
      ]
    }
  },
  {
    "amount": 5000,
    "organization": {
      "name": "Org 1",
      "total_budget": 8000,
      "states": [
        {
          "name": "Maine",
          "code": "ME"
        },
        {
          "name": "Massachusetts",
          "code": "MA"
        }
      ]
    }
  }
]

My desired output is a single aggregation by state of the total_budget and amount columns. I have gotten pretty close with the following:

n = pd.json_normalize(donations, record_path=['organization', 'states'], meta=['amount', ['organization', 'total_budget'], ['organization', 'name']], record_prefix='states.')
df = pd.DataFrame(n)
grouped_df = df.groupby(['states.code', 'states.name', 'organization.name', 'organization.total_budget']).sum()

Though what this gives me is a breakdown by state, with the organization names still included:

MA          Massachusetts Org 1             8000                         5100
                          Org 2             10000                        5000
ME          Maine         Org 1             8000                         5100

I know that I need to keep my initial aggregate function the same way in order to produce the correct results, but I am not sure what the final step is to get my expected results that then group these results by state:

MA          Massachusetts     18000              10100
ME          Maine             8000               5100

I don't know if this applies to your actual data or not. The approach you created as a sample data limitation divides the data frame by the values you want to aggregate and removes the duplicate rows. It then groups and aggregates and combines the two data frames together.

df_a = df[['states.code', 'states.name', 'organization.name', 'amount']]
df_o = df[['states.code', 'states.name', 'organization.name', 'organization.total_budget']]
df = df_a.groupby(['states.code', 'states.name'])['amount'].sum().reset_index()
df_o.drop_duplicates(inplace=True)
df1 = df_o.groupby(['states.code', 'states.name'])['organization.total_budget'].sum().reset_index()
df1.merge(df, on=['states.code', 'states.name'], how='inner')
    states.code states.name organization.total_budget   amount
0   MA  Massachusetts   18000   10100
1   ME  Maine   8000    5100

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM