简体   繁体   English

使用R来获得波动性并使用峰值来平均。 互联网流量数据的比率

[英]Using R to get volatility and Peak to avg. Ratio of internet traffic data

I have network traffic data in the following for for each hour of a ten day period as follows in a R dataset. 对于R数据集中的以下十天的每小时,我有以下网络流量数据。

   Day   Hour         Volume          Category
    0    00            100            P2P
    0    00            50             email
    0    00            200            gaming
    0    00            200            video
    0    00            150            web
    0    00            120            P2P
    0    00            180            web
    0    00            80             email
    ....
    0    01            150            P2P
    0    01            200            P2P
    0    01             50            Web
    ...
    ...
    10   23            100            web
    10   23            200            email
    10   23            300            gaming
    10   23            300            gaming

As seen there are repetition of Category within a single hour also. 如图所示,一小时内也会重复分类。 I need to calculate the volatility and the peak hour to average hour ratios of these different application categories. 我需要计算这些不同应用类别的波动率和高峰小时与平均小时比率。

Volatility : Standard deviation of hourly volumes divided by hourly average. 波动率 :每小时交易量的标准差除以小时平均值。

Peak hour to avg. 高峰时段平均 hour ratio : Ratio of volume of the maximum hour to the vol. 小时比 :最大小时与体积的体积比。 of the average hour for that application. 该申请的平均小时数。

So how do I aggregate and calculate these two statistics for each category? 那么如何汇总和计算每个类别的这两个统计数据呢? I am new to R and don't have much knowledge of how to aggregate and get the averages as mentioned. 我是R的新手,并且对如何聚合和获得所提到的平均值知之甚少。

So, the final result would look something like this where first the volume for each category is aggregated on a single 24 hour period by summing the volume and then calculating the two statistics 因此,最终结果看起来像这样,首先通过对体积求和然后计算两个统计数据,在一个24小时内聚合每个类别的体积。

Category    Volatility      Peak to Avg. Ratio
Web            0.55            1.5
P2P            0.30            2.1
email          0.6             1.7
gaming         0.4             2.9

Edit: plyr got me as far as this. 编辑:普莱尔让我这么做。

stats = ddply(
    .data = my_data
    , .variables = .( Hour , Category)
    , .fun = function(x){
        to_return = data.frame(
            volatility = sd((x$Volume)/mean(x$Volume))
            , pa_ratio = max(x$Volume)/mean(x$Volume)
        )
        return( to_return )
    }
)

But this is not what I was hoping for. 但这不是我所希望的。 I want the statistics per Category where all the hours of the days are aggregated first into 24 hours by summing the volumes and then the volatility and PA ratio calculated. 我想要每个类别的统计数据,其中所有小时数首先汇总到24小时,然后汇总数量,然后计算波动率和PA比率。 Any suggestions for improvement? 有任何改进建议吗?

You'd need to do it in two stages (using the plyr package): First, as you pointed out, there can be multiple Day-Hour combos for the same category, so we first aggregate, for each category, its totals within each Hour, regardless of the day : 您需要分两个阶段进行(使用plyr包):首先,正如您所指出的,同一类别可以有多个Day-Hour组合,因此我们首先针对每个类别汇总每个类别中的总计小时, 无论白天

df1 <- ddply( df, .(Hour, Category), summarise, Volume = sum(Volume))

Then you get your stats: 然后你得到你的统计数据:

> ddply(df1, .(Category), summarise,
+            Volatility = sd(Volume)/mean(Volume),
+            PeakToAvg = max(Volume)/mean(Volume) )

  Category Volatility PeakToAvg
1      P2P  0.3225399  1.228070
2      Web         NA  1.000000
3    email  0.2999847  1.212121
4   gaming  0.7071068  1.500000
5    video         NA  1.000000
6      web  0.7564398  1.534884

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM