如何计算熊猫中的分类时间序列数据

Question

This week I decided to dive a bit into pandas. 本周我决定潜入大熊猫。 I have a pandas DataFrame with historical IRC logs that looks like this: 我有一个带有历史IRC日志的pandas DataFrame，如下所示：

timestamp           action   nick        message
2005-11-04 01:44:33 False    hack-cclub  lex, hey!
2005-11-04 01:44:43 False    hack-cclub  lol, yea thats broke
2005-11-04 01:44:56 False    lex         Slashdot - Updated 2005-11-04 00:23:00 | Micro...
2005-11-04 01:44:56 False    hack-cclub  lex slashdot
2005-11-04 01:45:12 False    lex         port 666 is doom - doom Id Software (or mdqs o..
2005-11-04 01:45:12 False    hack-cclub  lex, port 666
2005-11-04 01:45:21 False    hitokiri    lex, port 23485
2005-11-04 01:45:45 False    hitokiri    lex, port 1024
2005-11-04 01:45:46 True     hack-cclub  slaps lex around with a wet fish

There are roughly 5.5M rows and I'm trying to make some basic visualizations like rank over time for the top 25 nicks and that sort of thing. 有大约5.5M的行，我正在尝试制作一些基本的可视化，如排名前25位的尼克斯等等。 I know I can get the top 25 nicks like this: 我知道我可以得到这样的前25个缺口：

df['nick'].value_counts()[:25]

What I want is a rolling count like this: 我想要的是滚动计数如下：

hack-cclub lex hitokiri
1          0   0
2          0   0
2          1   0
3          1   0
3          2   0
4          2   0
4          2   1
4          2   2
5          2   2

So that I can plot an area graph of messages from the beginning of time for the top 25 nicks. 因此，我可以从前25个刻痕开始绘制消息的区域图。 I know I can do this by just iterating over the entire dataframe and keeping a count but since the whole point of doing this is to learn to use pandas I was hoping there would be a more idiomatic way to do it. 我知道我可以通过迭代整个数据框并保持计数来做到这一点但是因为这样做的全部意义是学习使用pandas我希望有更多的惯用方法来做到这一点。 It would also be nice to have the same data but with ranks rather than running counts like this: 拥有相同的数据但使用排名而不是像这样运行计数也是很好的：

hack-cclub lex hitokiri
1          2   2
1          2   2
1          2   3
1          2   3
1          2   3
1          2   3
1          2   3
1          2   2
1          2   2

Answer 1

IIUC you need crosstab and cumsum : IIUC你需要crosstab和cumsum ：

print df[['timestamp', 'nick']]
             timestamp        nick
0  2005-11-04 01:44:33  hack-cclub
1  2005-11-04 01:44:43  hack-cclub
2  2005-11-04 01:44:56         lex
3  2005-11-04 01:44:56  hack-cclub
4  2005-11-04 01:45:12         lex
5  2005-11-04 01:45:12  hack-cclub
6  2005-11-04 01:45:21    hitokiri
7  2005-11-04 01:45:45    hitokiri
8  2005-11-04 01:45:46  hack-cclub

df = pd.crosstab(df.timestamp, df.nick)
print df
nick                 hack-cclub  hitokiri  lex
timestamp                                     
2005-11-04 01:44:33           1         0    0
2005-11-04 01:44:43           1         0    0
2005-11-04 01:44:56           1         0    1
2005-11-04 01:45:12           1         0    1
2005-11-04 01:45:21           0         1    0
2005-11-04 01:45:45           0         1    0
2005-11-04 01:45:46           1         0    0

df = df.cumsum()
print df
nick                 hack-cclub  hitokiri  lex
timestamp                                     
2005-11-04 01:44:33           1         0    0
2005-11-04 01:44:43           2         0    0
2005-11-04 01:44:56           3         0    1
2005-11-04 01:45:12           4         0    2
2005-11-04 01:45:21           4         1    2
2005-11-04 01:45:45           4         2    2
2005-11-04 01:45:46           5         2    2

如何计算熊猫中的分类时间序列数据

问题描述

1 个解决方案

解决方案1
2 已采纳 2015-12-30 15:30:33

如何计算熊猫中的分类时间序列数据

问题描述

1 个解决方案

解决方案1 2 已采纳 2015-12-30 15:30:33

解决方案1
2 已采纳 2015-12-30 15:30:33