简体   繁体   中英

Creating summary statistics from timestamped traffic counters

I am collecting traffic information for a special use case where I have approx. every 10 minutes (but not precisely) a timestamped value of the traffic counter such as:

11:45 100
11:56 110
12:05 120
12:18 130
...

This is the data I have and I cannot improve that.

I would like to produce some sort of hourly/daily statistics from this input, could you suggest some ready-made functions or algorithms in python?

I am thinking of binning the timestamped counters into hours and taking the first timestamp for the hour vs the last one and showing the difference as the traffic flow in the given hour, however since this may start not precisely with the hour (eg with the above data, it starts with 120 @ 12:05), it could be quite off and it would be nice to include also proportionally the previous data (eg ((120-110)/9)*5). However I do not want to reinvent the wheel.

-- UPDATE --

Based on the below suggestions I have looked into pandas and produced the code below. As a clarification to the above written background, the timestamped values are second-level and distributed irregularly within the minute (eg 11:45:03, 11:56:34 etc.). So the below code takes the input, reindexes it to second-level, performs linear interpolation (assuming that traffic is evenly distributed between measurement points), cuts the first and last fractional minutes (so that if the 1st data point is at 11:45:03, it is not distorted by the lack of the first 3 secs) and resamples the second-level data to minute-level. This is now working as expected, however it is very slow, I guess due to the second-level interpolation, as the data spans over months in total. Any ideas how to further improve or speed up the code?

import datetime
import pandas as pd
import numpy as np
import math

COLUMNS = ['date', 'lan_in', 'inet_in', 'lan_out', 'inet_out']

ts_converter = lambda x: datetime.datetime.fromtimestamp(int(x))
td = pd.read_table("traffic_log",
                   names = COLUMNS,
                   delim_whitespace = True,
                   header = None,
                   converters = { 'date' : ts_converter }).set_index('date')

# reindex to second-level data
td = td.reindex(pd.date_range(min(td.index), max(td.index), freq="s"))
# linear interpolation to fill data for all seconds
td = td.apply(pd.Series.interpolate)
# cut first and last fractional minute data
td = td[pd.Timestamp(long(math.ceil(td.index.min().value/(1e9*60))*1e9*60)):
        pd.Timestamp(long(math.floor(td.index.max().value/(1e9*60))*1e9*60))]
# resample to minute-level taking the minimum value for each minute
td = td.resample("t", how="min")
# change absolute values to differences
td = td.apply(pd.Series.diff)
# create daily statistics in gigabytes
ds = td.resample("d", how="sum").apply(lambda v: v/1024/1024/1024)
# create speed columns
for i in COLUMNS[1:]:
    td[i+'_speed'] = td[i] / 60 / 1024

If i understood your problem correctly maybe this will help:

df = pd.DataFrame( [ ['11:45', 100 ], ['11:56', 110], ['12:05', 120], ['12:18', 130]], 
                   columns=['tick', 'val'] )
df.tick = df.tick.map ( pd.Timestamp )

so df looks like this:

                 tick  val
0 2013-12-10 11:45:00  100
1 2013-12-10 11:56:00  110
2 2013-12-10 12:05:00  120
3 2013-12-10 12:18:00  130

now you can compute length of each interval, and find the hourly average:

df[ 'period' ] = df.tick - df.tick.shift( 1 )
df.period = df.period.div( np.timedelta64( '1', 'h' ) )
df[ 'chval' ] = df.val - df.val.shift( 1 )
df[ 'havg' ] = df.chval / df.period  

output:

                 tick  val  period  chval     havg
0 2013-12-10 11:45:00  100     NaN    NaN      NaN
1 2013-12-10 11:56:00  110  0.1833     10  54.5455
2 2013-12-10 12:05:00  120  0.1500     10  66.6667
3 2013-12-10 12:18:00  130  0.2167     10  46.1538

to take to account that some periods span over one hour, i think one solution is to change the frequency to minute, backward fill all the nan values, and then resample hourly with mean calculation:

df = df.set_index( 'tick' ).asfreq( freq='T', method='bfill' )
df = df.shift( -1 ).resample( 'h', how='mean' ) 

output:

                          val  period  chval     havg
2013-12-10 11:00:00  112.6667  0.1744     10  57.7778
2013-12-10 12:00:00  127.2222  0.1981     10  51.8519

now i think the havg values are coorect, as

( 10 + 10 * 4 / 9 ) / 15 * 60 = 57.7778
(      10 * 5 / 9 + 10 ) / 18 * 60 = 51.8519

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM