简体   繁体   English

从带时间戳的流量计数器创建摘要统计信息

[英]Creating summary statistics from timestamped traffic counters

I am collecting traffic information for a special use case where I have approx. 我正在收集一个特殊用例的交通信息,我有约。 every 10 minutes (but not precisely) a timestamped value of the traffic counter such as: 每10分钟(但不完全是)流量计数器的带时间戳的值,例如:

11:45 100
11:56 110
12:05 120
12:18 130
...

This is the data I have and I cannot improve that. 这是我拥有的数据,我无法改进。

I would like to produce some sort of hourly/daily statistics from this input, could you suggest some ready-made functions or algorithms in python? 我想从这个输入中产生一些每小时/每日统计数据,你能在python中建议一些现成的函数或算法吗?

I am thinking of binning the timestamped counters into hours and taking the first timestamp for the hour vs the last one and showing the difference as the traffic flow in the given hour, however since this may start not precisely with the hour (eg with the above data, it starts with 120 @ 12:05), it could be quite off and it would be nice to include also proportionally the previous data (eg ((120-110)/9)*5). 我正在考虑将带时间戳的计数器分成几小时,并将小时的第一个时间戳与最后一个时间戳相比较,并将差异显示为给定小时内的流量,但是因为这可能不是精确地以小时开始(例如,以上数据,从120 @ 12:05开始),它可能完全关闭,并且也可以按比例包括先前的数据(例如((120-110)/ 9)* 5)。 However I do not want to reinvent the wheel. 但是我不想重新发明轮子。

-- UPDATE -- - 更新 -

Based on the below suggestions I have looked into pandas and produced the code below. 根据以下建议,我调查了大熊猫并制作了以下代码。 As a clarification to the above written background, the timestamped values are second-level and distributed irregularly within the minute (eg 11:45:03, 11:56:34 etc.). 作为对上述书面背景的澄清,时间戳值是二级的并且在一分钟内不规则地分布(例如,11:45:03,11:56:34等)。 So the below code takes the input, reindexes it to second-level, performs linear interpolation (assuming that traffic is evenly distributed between measurement points), cuts the first and last fractional minutes (so that if the 1st data point is at 11:45:03, it is not distorted by the lack of the first 3 secs) and resamples the second-level data to minute-level. 因此,下面的代码接受输入,将其重新索引到第二级,执行线性插值(假设流量均匀分布在测量点之间),减少第一个和最后一个小数分钟(如果第一个数据点是11:45) :03,它没有被前3秒的缺失扭曲)并将第二级数据重新采样到分钟级别。 This is now working as expected, however it is very slow, I guess due to the second-level interpolation, as the data spans over months in total. 现在这已经按预期工作了,但是它非常慢,我想由于二级插值,因为数据总共超过几个月。 Any ideas how to further improve or speed up the code? 有关如何进一步改进或加快代码的任何想法?

import datetime
import pandas as pd
import numpy as np
import math

COLUMNS = ['date', 'lan_in', 'inet_in', 'lan_out', 'inet_out']

ts_converter = lambda x: datetime.datetime.fromtimestamp(int(x))
td = pd.read_table("traffic_log",
                   names = COLUMNS,
                   delim_whitespace = True,
                   header = None,
                   converters = { 'date' : ts_converter }).set_index('date')

# reindex to second-level data
td = td.reindex(pd.date_range(min(td.index), max(td.index), freq="s"))
# linear interpolation to fill data for all seconds
td = td.apply(pd.Series.interpolate)
# cut first and last fractional minute data
td = td[pd.Timestamp(long(math.ceil(td.index.min().value/(1e9*60))*1e9*60)):
        pd.Timestamp(long(math.floor(td.index.max().value/(1e9*60))*1e9*60))]
# resample to minute-level taking the minimum value for each minute
td = td.resample("t", how="min")
# change absolute values to differences
td = td.apply(pd.Series.diff)
# create daily statistics in gigabytes
ds = td.resample("d", how="sum").apply(lambda v: v/1024/1024/1024)
# create speed columns
for i in COLUMNS[1:]:
    td[i+'_speed'] = td[i] / 60 / 1024

If i understood your problem correctly maybe this will help: 如果我理解你的问题可能会有所帮助:

df = pd.DataFrame( [ ['11:45', 100 ], ['11:56', 110], ['12:05', 120], ['12:18', 130]], 
                   columns=['tick', 'val'] )
df.tick = df.tick.map ( pd.Timestamp )

so df looks like this: 所以df看起来像这样:

                 tick  val
0 2013-12-10 11:45:00  100
1 2013-12-10 11:56:00  110
2 2013-12-10 12:05:00  120
3 2013-12-10 12:18:00  130

now you can compute length of each interval, and find the hourly average: 现在你可以计算每个区间的长度,并找到每小时平均值:

df[ 'period' ] = df.tick - df.tick.shift( 1 )
df.period = df.period.div( np.timedelta64( '1', 'h' ) )
df[ 'chval' ] = df.val - df.val.shift( 1 )
df[ 'havg' ] = df.chval / df.period  

output: 输出:

                 tick  val  period  chval     havg
0 2013-12-10 11:45:00  100     NaN    NaN      NaN
1 2013-12-10 11:56:00  110  0.1833     10  54.5455
2 2013-12-10 12:05:00  120  0.1500     10  66.6667
3 2013-12-10 12:18:00  130  0.2167     10  46.1538

to take to account that some periods span over one hour, i think one solution is to change the frequency to minute, backward fill all the nan values, and then resample hourly with mean calculation: 考虑到一些时段跨越一个小时,我认为一种解决方案是将频率更改为分钟,向后填充所有nan值,然后使用平均值计算每小时重新采样:

df = df.set_index( 'tick' ).asfreq( freq='T', method='bfill' )
df = df.shift( -1 ).resample( 'h', how='mean' ) 

output: 输出:

                          val  period  chval     havg
2013-12-10 11:00:00  112.6667  0.1744     10  57.7778
2013-12-10 12:00:00  127.2222  0.1981     10  51.8519

now i think the havg values are coorect, as 现在我认为havg值是coorect,as

( 10 + 10 * 4 / 9 ) / 15 * 60 = 57.7778
(      10 * 5 / 9 + 10 ) / 18 * 60 = 51.8519

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM