简体   繁体   English

熊猫groupby对象唯一计数性能

[英]Pandas groupby object unique count performance

I have a large dataset of transaction data which looks like: 我有一个很大的交易数据集,看起来像:

| cust_no | acct_no | trans_id | product_id | ..... |

I try several way to count how many unique accounts per customer, and how many unique products customer buy etc. 我尝试几种方法来计算每个客户有多少个唯一帐户,以及客户购买了多少个唯一产品等。

  • Method 1.a 方法1.a

transaction_df[['cust_no','acct_no']].groupby('cust_no')['acct_no'].nunique()

which runs average 91.5ms 平均运行91.5毫秒

  • Method 1.b 方法1.b

transaction_df.groupby('cust_no')['acct_no'].nunique()

which runs average 85.5ms 平均运行85.5毫秒

  • Method 2.a 方法2.a

transaction_df[['cust_no','acct_no']].groupby(['cust_no','acct_no']).size().groupby('cust_no').size()

which runs 61.5ms 运行61.5ms

  • Method 2.a 方法2.a

transaction_df.groupby(['cust_no','acct_no']).size().groupby('cust_no').size()

which runs 55.3ms 运行55.3ms

I have two question: 我有两个问题:

  1. why is the the DataFrame after slicing run slower, ie transaction_df[['cust_no','acct_no']] is slower than just transaction_df ? 切片后的DataFrame为什么运行得较慢,即transaction_df[['cust_no','acct_no']]比仅transaction_df慢?

  2. why .nunique() method is much slower than just stack up two groupby ? 为什么.nunique()方法比仅堆叠两个groupby慢得多?

1) Slicing requires memory assignment and/or a copy of the object depending on the operation. 1)切片需要根据操作分配内存和/或对象的副本。 Here you're creating a new DataFrame before starting your operations. 在这里,您开始操作之前先创建一个新的DataFrame。

2) nunique is going to either implement logic for or directly call a set , which runs in O(N) time. 2) nunique将为O实现一个逻辑,或者直接调用set ,它在O(N)时间运行。 size will run O(1) size将运行O(1)

Knowing prior structural information about your dataset can help you optimize function selection as you're experimenting with here. 在此处进行实验时,了解有关数据集的先前结构信息可以帮助您优化功能选择。 Read into https://en.wikipedia.org/wiki/Time_complexity if you're interested 如果您有兴趣, 阅读https://en.wikipedia.org/wiki/Time_complexity

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM