简体   繁体   English

带有lambda函数的pandas groupby中无法使用.size().div()方法

[英]Unable to use .size() .div() methods inside pandas groupby with lambda function

I'm using the following lines of code to compute the conditional probabilities 我正在使用以下代码行来计算条件概率

    variable = 'variable_name'
    probs = df.groupby(variable).size().div(len(df))
    cond_probs = df.groupby([variable, 'has_income']).size().div(len(df)).div(probs, axis=0, level=variable)

Those results in the following output: 这些将导致以下输出:

    varibale_name         has_income
    (0.999, 2.0]          False          0.756323
                          True           0.243677
    (2.0, 3.0]            False          0.798372
                          True           0.201628
    (3.0, 16.0]           False          0.809635
                          True           0.190365

I would like to add an additional column to the output as the size of the sample for each group, but I'm not able to rewrite the formula inside the lambda function because the group object doesn't have the same methods as the objects returned by df.groupby() . 我想在输出中添加额外的列作为每个组的样本大小,但是我无法在lambda函数中重写公式,因为组对象与返回的对象没有相同的方法通过df.groupby() Example: 例:

    cond_probs =df.groupby([variable, 'has_income']).apply(lambda x: 
    pd.Series({
        'probs': x.size().div(len(df)).div(probs, axis=0, level=variable),
        'size': x.size()
    }))

Error: TypeError: 'numpy.int32' object is not callable 错误:TypeError:“ numpy.int32”对象不可调用

Are there any alternative to achieve these results in a fancy way, without computing two groupby and joining the data frames at the end? 是否有其他选择可以以理想的方式获得这些结果,而无需计算两个groupby并在最后加入数据帧?

When you use apply with groupby , you don't get a group object, but a slice of the dataframe that corresponds to the relevant group. 当将applygroupby一起使用时,您不会获得组对象,但是会得到与相关组相对应的数据框的一部分。 So x in your case is a DataFrame, not a GroupBy object - treat it the same you'd treat df . 所以x在你的情况下是一个DataFrame,而不是一个GroupBy对象-对待它的方式与对待df相同。

cond_probs = df.groupby([variable, 'has_income']).apply(lambda x: 
  pd.Series({
    'probs': (len(x) / len(df)) / probs[x.iloc[0][variable]],
    'size': len(x)
  })
)

NB if you use .size on a dataframe, it will return the total number of cells - so it's not the same as GroupBy.size ( docs ) NB如果使用.size上一个数据帧,它将返回细胞的总数-所以它不是一样GroupBy.size文档

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM