计算字符串/类别 pandas groupby 聚合的出现次数

Question

I have data in a tabular format with id per row.我有表格格式的数据，每行都有 id。 In columns I have set a flag that has one or more categorical values ie condition_one, condition_two在列中，我设置了一个具有一个或多个分类值的标志，即 condition_one、condition_two

I'm generating summary statistics using the below:我正在使用以下内容生成汇总统计信息：


function_count_certain_condition = lambda x: x.str.count("condition_two").sum()

function_count_certain_condition.__name__ = 'number_of_two_conditions'

# ---
aggregations = {
'column_one': ['count','first','last','nunique'],
'conditions_column': [function_count_certain_condition]
} 

df_aggregate_stats = df.groupby(['id_column']).agg(aggregations)

This works but doesn't seem particularly pythonic or performant.这可行，但似乎不是特别pythonic或高性能。 I tried using value_counts() but got a key error我尝试使用 value_counts() 但出现关键错误

Answer 1

particularly pythonic特别是蟒蛇

Yes, you are using lambda which is stored in variable (whole point of lambda is missing if it is not nameless) and then shove name for it.是的，您正在使用存储在变量中的 lambda（如果它不是无名的，则 lambda 的整个点都会丢失），然后为其命名。 Just use def to define function, that is replace只需使用def定义函数，即替换

function_count_certain_condition = lambda x: x.str.count("condition_two").sum()

function_count_certain_condition.__name__ = 'number_of_two_conditions'

using使用

def number_of_two_conditions(x):
    return x.str.count("condition_two").sum()

performant高性能的

Firstly be warned against premature optimization.首先要注意不要过早优化。 If that code works fast enough for your code do not try force it to be faster.如果该代码对您的代码运行得足够快，请不要尝试强制它更快。 Regarding that particular function I do not see anything to cause excessive execution time as both substring counting and addition are generally fast operations.关于那个特定的函数，我看不出有什么会导致执行时间过长，因为子字符串计数和加法通常都是快速操作。

计算字符串/类别 pandas groupby 聚合的出现次数

问题描述

1 个解决方案

解决方案1
0 2022-05-27 09:05:26

计算字符串/类别 pandas groupby 聚合的出现次数

问题描述

1 个解决方案

解决方案1 0 2022-05-27 09:05:26

解决方案1
0 2022-05-27 09:05:26