简体   繁体   English

在熊猫中平均不同水平

[英]Averaging over different levels in pandas

I have a large dataset of music tagging data in a MySQL database that I am attempting to analyze with pandas. 我在尝试使用熊猫进行分析的MySQL数据库中有大量的音乐标签数据集。 I exported it to .tsv from MySQL and now reading it in as a dataframe for analysis. 我将其从MySQL导出到.tsv,现在将其作为数据帧读取以进行分析。

Each row in the data is a tuple indicating that a given user (indicated by numeric user ID) tagged a particular artist with a particular tag (represented here as a numeric ID) at a particular time. 数据中的每一行都是一个元组,指示给定用户(由数字用户ID表示)在特定时间用特定标签(在此表示为数字ID)标记了特定艺术家。 So with no indexes a sample of the data would look like this: 因此,没有索引的数据样本将如下所示:

       uid  artist   tag        date
0  2096963     559    46  2005-07-01
1  2096963     584  1053  2005-07-01
2  2096963     584  2044  2005-07-01
3  2096963     584  2713  2005-07-01
4  2096963     596   236  2005-07-01
...
       uid  artist   tag        date
99995  2656262    8095    57  2005-08-01
99996  2656262    8095    79  2005-08-01
99997  2656262    8095  4049  2005-08-01
99998  2656262    8095  8290  2005-08-01
99999  2610168    8095  1054  2005-08-01

To facilitate analyses, I've indexed everything and added a dummy annotations variable (each row in the data represents one tagging instance, or annotation). 为了便于分析,我为所有内容建立了索引,并添加了一个虚拟注释变量(数据中的每一行代表一个标记实例或注释)。 So now we have: 现在我们有了:

data = pd.read_table(filename,header=None, names=('uid','artist','tag','date'), index_col=['date','uid','artist','tag'], parse_dates='date') 
data['annotations'] = 1

In [41]: data.head()
Out[41]:
                                annotations
date       uid     artist tag
2005-07-01 2096963 559    46              1
                   584    1053            1
                          2044            1
                          2713            1
                   596    236             1
...

With the data formatted like this, it's trivial to calculate simple frequency distributions. 使用这种格式的数据,计算简单的频率分布很简单。 For instance, if I want to determine the number of times each user tagged something (in descending freq. order), it's as simple as: 例如,如果我想确定每个用户标记某物的次数(以降序排列),则很简单:

data.sum(level='uid').sort('anno',ascending=False)

Similarly, I can determine the total number of annotations each month (across all users and tags) with: 同样,我可以使用以下方法确定每月(所有用户和标签中)注释的总数:

data.sum(level='date')

But I'm having trouble with more complex calculations. 但是我在进行更复杂的计算时遇到了麻烦。 In particular, what if I want the the mean number of annotations per user each month? 特别是, 如果我想要每个用户每月平均注解数怎么办? If I call: 如果我打电话给:

data.sum(level=['date','uid']).head()

I get the number of annotations per user each month, ie: 我得到每个用户每个月的注释数量,即:

                    anno
date       uid
2005-07-01 1040740    10
           1067454    23
           2096963   136
           2115894     1
           2163842     4
...

but what's a simple way to then get a monthly average of those values across users? 但是,要获得每个用户每月这些值的平均值的简单方法是什么? That is, for each month, what is the average across users of the the "anno" column? 也就是说,每个月“ anno”列的用户平均水平是多少? I have various metrics like this I want to calculate, so I'm hoping the solution generalizes. 我想计算各种指标,因此希望该解决方案能够全面推广。

Big MultiIndexes can be a hassle. 大MultiIndexes可能很麻烦。 I suggest abandoning your dummy column, 'annotations', and using count instead of sum . 我建议放弃您的虚拟列“注释”,并使用count而不是sum

To start, read in the data without assigning an index, ie, 首先,在不分配索引的情况下读入数据,即

pd.read_table(filename,header=None, names=['uid','artist','tag','date'], parse_dates='date')

To count each user's annotations: 要计算每个用户的注释:

data.groupby('uid').count().sort(ascending=False)

To total annotations per day: 每天总计注释:

data.groupby('date').count()

The count unique users each day: 每天计算不重复使用者人数:

daily_users = data.groupby('date').uid.nunique()

To total annotations each day: 每天总计注释:

daily_annotations = data.groupby('date').count()

The average daily annotations per user is just the daily total annotations divided by the number of users that day. 每个用户的平均每日批注只是每日总批注除以当天的用户数。 As a result of the groupby operation, both of these Series are indexed by date, so they will align automatically. 由于groupby操作,这两个系列都按日期编入索引,因此它们将自动对齐。

mean_daily_annotations_per_user = daily_annotations/daily_users

To average annotations per month across users , it is most convenient to use resample , a nice feature for grouping by different time frequencies. 要平均每个用户每月的注释 ,最方便的方法是使用resample ,这是按不同时间频率进行分组的好功能。

mean_monthly_annotations_per_user = mean_daily_anootations_per_user.resample('M')

I figured out an alternative approach that fits my original multi-index format, and I think is faster than the method proposed by @DanAllan. 我想出了一种适合我原始的多索引格式的替代方法,并且我认为它比@DanAllan提出的方法要快。

Recalling that we're calculating the mean annotations per user per month, let's build two dataframes (I'm using just a subset of the data here, hence the nrows argument). 回想一下我们正在计算每个用户每月的平均注释,让我们构建两个数据框(我在这里仅使用数据的一个子集,因此使用了nrows参数)。 data1 is the multi-index version with dummy variable, and data2 is the unindexed version proposed by @DanAllan data1是带有伪变量的多索引版本,而data2是@DanAllan提出的未索引版本

indexes=['date','uid','artist','iid','tag']
data1 = pd.read_table(filename,header=None, nrows=1000000, names=('uid','iid','artist','tag','date'),index_col=indexes, parse_dates='date') 
data['anno']=1
data2 = pd.read_table(filename,header=None, nrows=1000000, names=('uid','iid','artist','tag','date'), parse_dates='date') 

With the unindexed (data2) version the process is: 使用未索引(data2)版本时,该过程为:

daily_users = data2.groupby('date').uid.nunique()
daily_annotations = data2.groupby('date').count().uid
anno_per_user_perday2 = daily_annotations / daily_users.map(float)

With the multi-index version (data1), we can do: 使用多索引版本(data1),我们可以执行以下操作:

anno_per_user_perday = data1.sum(level=['date','uid']).mean(level='date').anno

The result is exactly the same, but more than twice as fast with the indexed version (performance will be more of an issue with the full, 50 million row dataset): 结果是完全相同的,但是使用索引版本的速度要快两倍以上(对于完整的5000万行数据集,性能将是一个更大的问题):

%timeit -n100 daily_users = data2.groupby('date').uid.nunique() ; daily_annotations = data2.groupby('date').count().uid ; anno_per_user_perday2 = daily_annotations / daily_users.map(float)
100 loops, best of 3: 387 ms per loop

%timeit -n100 anno_per_user_perday1 = data1.sum(level=['date','uid']).mean(level='date').anno
100 loops, best of 3: 149 ms per loop

Generating the dataframe is slower with the indexed version, but the flexibility it affords seems worth it. 使用索引版本生成数据帧的速度较慢,但​​是它提供的灵活性似乎值得。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM