简体   繁体   English

熊猫与groupby分为两列

[英]Pandas division of two columns with groupby

This is obviously simple, but as a pandas newbe I'm getting stuck. 这显然很简单,但作为一只熊猫,我会陷入困境。

I have a CSV file that contains 3 columns, the State, bene_1_count, and bene_2_count. 我有一个包含3列的CSV文件,State,bene_1_count和bene_2_count。

I want to calculate the ratio of 'bene_1_count' and 'bene_2_count' in a given state. 我想计算给定状态下'bene_1_count'和'bene_2_count'的比例。

df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
           'bene_1_count': [np.random.randint(10000, 99999)
                     for _ in range(12)],
           'bene_2_count': [np.random.randint(10000, 99999)
                     for _ in range(12)]})

I am trying the following, but it is giving me an error: 'No objects to concatenate' 我正在尝试以下内容,但它给了我一个错误:'没有连接的对象'

df['ratio'] = df.groupby(['state']).agg(df['bene_1_count']/df['bene_2_count'])

I am not able to figure out how to "reach up" to the state level of the groupby to take the ratio of columns. 我无法弄清楚如何“达到”群组的状态级别来获取列的比率。

I want the ratio of columns wrt a state, like I want my output as follows: 我希望列的比例与状态相似,就像我想要的输出如下:

    State       ratio

    CA  
    WA  
    CO  
    AZ  

Alternatively, stated: You can create custom functions that accept a dataframe. 或者,声明:您可以创建接受数据框的自定义函数。 The groupby will return sub-dataframes. groupby将返回子数据帧。 You can then use the apply function to apply your custom function to each sub-dataframe. 然后,您可以使用apply函数将自定义函数应用于每个子数据帧。

df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
           'bene_1_count': [np.random.randint(10000, 99999)
                     for _ in range(12)],
           'bene_2_count': [np.random.randint(10000, 99999)
                     for _ in range(12)]})

def divide_two_cols(df_sub):
    return df_sub['bene_1_count'].sum() / float(df_sub['bene_2_count'].sum())

df.groupby('state').apply(divide_two_cols)

Now say you want each row to be divided by the sum of each group (eg, the total sum of AZ) and also retain all the original columns. 现在假设您希望每行除以每组的总和(例如,AZ的总和)并保留所有原始列。 Just adjust the above function (change the calculation and return the whole sub dataframe): 只需调整上述功能(更改计算并返回整个子数据帧):

def divide_two_cols(df_sub):
    df_sub['divs'] = df_sub['bene_1_count'] / float(df_sub['bene_2_count'].sum())
    return df_sub

df.groupby('state').apply(divide_two_cols)

I believe what you first need to do is sum the counts by state before finding the ratio. 我相信你首先需要做的是在找到比率之前按州计算。 You can use apply to access the other columns in the df, and then store them in a dictionary to map to the corresponding state in the original dataframe. 您可以使用apply访问df中的其他列,然后将它们存储在字典中以映射到原始数据帧中的相应状态。

import pandas as pd
import numpy as np
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
            'bene_1_count': [np.random.randint(10000, 99999)
                      for _ in range(12)],
            'bene_2_count': [np.random.randint(10000, 99999)
                      for _ in range(12)]})

ratios = df.groupby('state').apply(lambda x: x['bene_1_count'].sum() /
                                   x['bene_2_count'].sum().astype(float)).to_dict()

df['ratio'] = df['state'].map(ratios)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM