简体   繁体   English

如何计算熊猫中的分类时间序列数据

[英]How to count categorical timeseries data in pandas

This week I decided to dive a bit into pandas. 本周我决定潜入大熊猫。 I have a pandas DataFrame with historical IRC logs that looks like this: 我有一个带有历史IRC日志的pandas DataFrame,如下所示:

timestamp           action   nick        message
2005-11-04 01:44:33 False    hack-cclub  lex, hey!
2005-11-04 01:44:43 False    hack-cclub  lol, yea thats broke
2005-11-04 01:44:56 False    lex         Slashdot - Updated 2005-11-04 00:23:00 | Micro...
2005-11-04 01:44:56 False    hack-cclub  lex slashdot
2005-11-04 01:45:12 False    lex         port 666 is doom - doom Id Software (or mdqs o..
2005-11-04 01:45:12 False    hack-cclub  lex, port 666
2005-11-04 01:45:21 False    hitokiri    lex, port 23485
2005-11-04 01:45:45 False    hitokiri    lex, port 1024
2005-11-04 01:45:46 True     hack-cclub  slaps lex around with a wet fish

There are roughly 5.5M rows and I'm trying to make some basic visualizations like rank over time for the top 25 nicks and that sort of thing. 有大约5.5M的行,我正在尝试制作一些基本的可视化,如排名前25位的尼克斯等等。 I know I can get the top 25 nicks like this: 我知道我可以得到这样的前25个缺口:

df['nick'].value_counts()[:25]

What I want is a rolling count like this: 我想要的是滚动计数如下:

hack-cclub lex hitokiri
1          0   0
2          0   0
2          1   0
3          1   0
3          2   0
4          2   0
4          2   1
4          2   2
5          2   2

So that I can plot an area graph of messages from the beginning of time for the top 25 nicks. 因此,我可以从前25个刻痕开始绘制消息的区域图。 I know I can do this by just iterating over the entire dataframe and keeping a count but since the whole point of doing this is to learn to use pandas I was hoping there would be a more idiomatic way to do it. 我知道我可以通过迭代整个数据框并保持计数来做到这一点但是因为这样做的全部意义是学习使用pandas我希望有更多的惯用方法来做到这一点。 It would also be nice to have the same data but with ranks rather than running counts like this: 拥有相同的数据但使用排名而不是像这样运行计数也是很好的:

hack-cclub lex hitokiri
1          2   2
1          2   2
1          2   3
1          2   3
1          2   3
1          2   3
1          2   3
1          2   2
1          2   2

IIUC you need crosstab and cumsum : IIUC你需要crosstabcumsum

print df[['timestamp', 'nick']]
             timestamp        nick
0  2005-11-04 01:44:33  hack-cclub
1  2005-11-04 01:44:43  hack-cclub
2  2005-11-04 01:44:56         lex
3  2005-11-04 01:44:56  hack-cclub
4  2005-11-04 01:45:12         lex
5  2005-11-04 01:45:12  hack-cclub
6  2005-11-04 01:45:21    hitokiri
7  2005-11-04 01:45:45    hitokiri
8  2005-11-04 01:45:46  hack-cclub

df = pd.crosstab(df.timestamp, df.nick)
print df
nick                 hack-cclub  hitokiri  lex
timestamp                                     
2005-11-04 01:44:33           1         0    0
2005-11-04 01:44:43           1         0    0
2005-11-04 01:44:56           1         0    1
2005-11-04 01:45:12           1         0    1
2005-11-04 01:45:21           0         1    0
2005-11-04 01:45:45           0         1    0
2005-11-04 01:45:46           1         0    0

df = df.cumsum()
print df
nick                 hack-cclub  hitokiri  lex
timestamp                                     
2005-11-04 01:44:33           1         0    0
2005-11-04 01:44:43           2         0    0
2005-11-04 01:44:56           3         0    1
2005-11-04 01:45:12           4         0    2
2005-11-04 01:45:21           4         1    2
2005-11-04 01:45:45           4         2    2
2005-11-04 01:45:46           5         2    2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM