[英]How to count categorical timeseries data in pandas
This week I decided to dive a bit into pandas. 本周我决定潜入大熊猫。 I have a pandas DataFrame with historical IRC logs that looks like this:
我有一个带有历史IRC日志的pandas DataFrame,如下所示:
timestamp action nick message
2005-11-04 01:44:33 False hack-cclub lex, hey!
2005-11-04 01:44:43 False hack-cclub lol, yea thats broke
2005-11-04 01:44:56 False lex Slashdot - Updated 2005-11-04 00:23:00 | Micro...
2005-11-04 01:44:56 False hack-cclub lex slashdot
2005-11-04 01:45:12 False lex port 666 is doom - doom Id Software (or mdqs o..
2005-11-04 01:45:12 False hack-cclub lex, port 666
2005-11-04 01:45:21 False hitokiri lex, port 23485
2005-11-04 01:45:45 False hitokiri lex, port 1024
2005-11-04 01:45:46 True hack-cclub slaps lex around with a wet fish
There are roughly 5.5M rows and I'm trying to make some basic visualizations like rank over time for the top 25 nicks and that sort of thing. 有大约5.5M的行,我正在尝试制作一些基本的可视化,如排名前25位的尼克斯等等。 I know I can get the top 25 nicks like this:
我知道我可以得到这样的前25个缺口:
df['nick'].value_counts()[:25]
What I want is a rolling count like this: 我想要的是滚动计数如下:
hack-cclub lex hitokiri
1 0 0
2 0 0
2 1 0
3 1 0
3 2 0
4 2 0
4 2 1
4 2 2
5 2 2
So that I can plot an area graph of messages from the beginning of time for the top 25 nicks. 因此,我可以从前25个刻痕开始绘制消息的区域图。 I know I can do this by just iterating over the entire dataframe and keeping a count but since the whole point of doing this is to learn to use pandas I was hoping there would be a more idiomatic way to do it.
我知道我可以通过迭代整个数据框并保持计数来做到这一点但是因为这样做的全部意义是学习使用pandas我希望有更多的惯用方法来做到这一点。 It would also be nice to have the same data but with ranks rather than running counts like this:
拥有相同的数据但使用排名而不是像这样运行计数也是很好的:
hack-cclub lex hitokiri
1 2 2
1 2 2
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 2
1 2 2
IIUC you need crosstab
and cumsum
: IIUC你需要
crosstab
和cumsum
:
print df[['timestamp', 'nick']]
timestamp nick
0 2005-11-04 01:44:33 hack-cclub
1 2005-11-04 01:44:43 hack-cclub
2 2005-11-04 01:44:56 lex
3 2005-11-04 01:44:56 hack-cclub
4 2005-11-04 01:45:12 lex
5 2005-11-04 01:45:12 hack-cclub
6 2005-11-04 01:45:21 hitokiri
7 2005-11-04 01:45:45 hitokiri
8 2005-11-04 01:45:46 hack-cclub
df = pd.crosstab(df.timestamp, df.nick)
print df
nick hack-cclub hitokiri lex
timestamp
2005-11-04 01:44:33 1 0 0
2005-11-04 01:44:43 1 0 0
2005-11-04 01:44:56 1 0 1
2005-11-04 01:45:12 1 0 1
2005-11-04 01:45:21 0 1 0
2005-11-04 01:45:45 0 1 0
2005-11-04 01:45:46 1 0 0
df = df.cumsum()
print df
nick hack-cclub hitokiri lex
timestamp
2005-11-04 01:44:33 1 0 0
2005-11-04 01:44:43 2 0 0
2005-11-04 01:44:56 3 0 1
2005-11-04 01:45:12 4 0 2
2005-11-04 01:45:21 4 1 2
2005-11-04 01:45:45 4 2 2
2005-11-04 01:45:46 5 2 2
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.