简体   繁体   English

计算数据中的连续条纹

[英]Calculating a consecutive streak in data

I'm trying to calculate the maximum winning and losing streak in a dataset (ie the highest number of consecutive positive or negative values). 我正在尝试计算数据集中的最大输赢(即最大连续正值或负值)。 I've found a somewhat related question here on StackOverflow and even though that gave me some good suggestions, the angle of that question is different, and I'm not (yet) experienced enough to translate and apply that information to this problem. 我在StackOverflow上找到了一个有点相关的问题 ,虽然这给了我一些很好的建议,但这个问题的角度是不同的,而且我还没有足够的经验来翻译并将这些信息应用到这个问题上。 So I was hoping you could help me out, even an suggestion would be great. 所以我希望你可以帮助我,即使一个建议会很棒。

My data set look like this: 我的数据集如下所示:

> subRes
   Instrument TradeResult.Currency.
1         JPM                    -3
2         JPM                   264
3         JPM                   284
4         JPM                    69
5         JPM                   283
6         JPM                  -219
7         JPM                   -91
8         JPM                   165
9         JPM                   -35
10        JPM                  -294
11        KFT                    -8
12        KFT                   -48
13        KFT                   125
14        KFT                  -150
15        KFT                  -206
16        KFT                   107
17        KFT                   107
18        KFT                    56
19        KFT                   -26
20        KFT                   189
> split(subRes[,2],subRes[,1])
$JPM
 [1]   -3  264  284   69  283 -219  -91  165  -35 -294
$KFT
 [1]   -8  -48  125 -150 -206  107  107   56  -26  189

In this case, the maximum (winning) streak for JPM is four (namely the 264, 284, 69 and 283 consecutive positive results) and for KFT this value is 3 (107, 107, 56). 在这种情况下,JPM的最大(获胜)条纹为4(即264,284,69和283连续阳性结果),对于KFT,该值为3(107,107,56)。

My goal is to create a function which gives the maximum winning streaks per instrument (ie JPM: 4, KFT: 3). 我的目标是创建一个函数,该函数给出每个乐器的最大连胜条件(即JPM:4,KFT:3)。 To achieve that: 为此:

R needs to compare the current result with the previous result, and if it is higher then there is a streak of at least 2 consecutive positive results. R需要将当前结果与先前结果进行比较,如果它更高,则存在至少2个连续正结果的条纹。 Then R needs to look at the next value, and if this is also higher: add 1 to the already found value of 2. If this value isn't higher, R needs to move on to the next value, while remembering 2 as the intermediate maximum. 然后R需要查看下一个值,如果这个值也更高:在已经找到的值2上加1,如果这个值不高,R需要继续下一个值,同时记住2为中间最大值。

I've tried cumsum and cummax in accordance with conditional summing (like cumsum(c(TRUE, diff(subRes[,2]) > 0)) ), which didn't work out. 我按照条件求和(如cumsum(c(TRUE, diff(subRes[,2]) > 0)) )尝试了cumsumcummax ,但没有成功。 Also rle in accordance with lapply (like lapply(rle(subRes$TradeResult.Currency.), function(x) diff(x) > 0) ) didn't work. 此外rle按照lapply (如lapply(rle(subRes$TradeResult.Currency.), function(x) diff(x) > 0)没有工作。

How can I make this work? 我怎样才能做到这一点?

Edit 19 January 2011 编辑2011年1月19日

Calculating the size of an streak Besides the length of the streak, I would also like to incorporate the size of the streak in my analysis. 计算条纹的大小除了条纹的长度,我还想在我的分析中加入条纹的大小。 With the answers provided below, I thought I was able to do it by myself, sadly I'm mistaken and run into the following problem(s): 通过下面提供的答案,我以为我能够自己做到这一点,遗憾的是我错了并遇到以下问题:

With the following data frame: 使用以下数据框:

> subRes
   Instrument TradeResult.Currency.
1         JPM                    -3
2         JPM                   264
3         JPM                   284
4         JPM                    69
5         JPM                   283
6         JPM                  -219
7         JPM                   -91
8         JPM                   165
9         JPM                   -35
10        JPM                  -294
11        KFT                    -8
12        KFT                   -48
13        KFT                   125
14        KFT                  -150
15        KFT                  -206
16        KFT                   107
17        KFT                   107
18        KFT                    56
19        KFT                   -26
20        KFT                   189
> lapply(split(subRes[,2], subRes[,1]), function(x) {
+             df.rle <- ifelse(x > 0, 1, 0)
+             df.rle <- rle(df.rle)
+ 
+             wh <- which(df.rle$lengths == max(df.rle$lengths))
+             mx <- df.rle$lengths[wh]
+             suma <- df.rle$lengths[1:wh]
+             out <- x[(sum(suma) - (suma[length(suma)] - 1)):sum(suma)]
+             return(out)
+         })
$JPM
[1] 264 284  69 283

$KFT
[1] 107 107  56

This result is correct, and changing the last line to return(sum(out)) I can get the total size of the streak: 这个结果是正确的,并且改变最后一行return(sum(out))我可以得到条纹的总大小:

$JPM
[1] 900

$KFT
[1] 270

However, the function does not seem to count the losing streaks when changing the ifelse condition: 但是,在更改ifelse条件时,该函数似乎不计算丢失条纹:

lapply(split(subRes[,2], subRes[,1]), function(x) {
            df.rle <- ifelse(x < 0, 1, 0)
            df.rle <- rle(df.rle)

            wh <- which(df.rle$lengths == max(df.rle$lengths))
            mx <- df.rle$lengths[wh]
            suma <- df.rle$lengths[1:wh]
            out <- x[(sum(suma) - (suma[length(suma)] - 1)):sum(suma)]
            return(out)
        })
$JPM
[1] 264 284  69 283

$KFT
[1] 107 107  56

I don't see what I need to change about this function to ultimately come to the total sum of the losing streak. 我没有看到我需要改变这个功能,最终得出连败的总和。 However I tweak/change the function, I get the same result or an error. 但是,我调整/更改功能,我得到相同的结果或错误。 The ifelse function confuses me, because it seems the obvious part of the function to change, yet doesn't result in any change. ifelse函数让我ifelse困惑,因为它似乎是函数的明显部分要改变,但不会导致任何改变。 What obvious point am I missing? 我错过了什么明显的观点?

This will work: 这将有效:

FUN <- function(x, negate = FALSE, na.rm = FALSE) {
    rles <- rle(x > 0)
    if(negate) {
        max(rles$lengths[!rles$values], na.rm = na.rm)
    } else {
        max(rles$lengths[rles$values], na.rm = na.rm)
    }
}
wins <- lapply(split(subRes[,2],subRes[,1]), FUN)
loses <- lapply(split(subRes[,2],subRes[,1]), FUN, negate = TRUE)

Giving this: 给这个:

> wins
$JPM
[1] 4

$KFT
[1] 3
> loses
$JPM
[1] 2

$KFT
[1] 2

or: 要么:

> sapply(split(subRes[,2],subRes[,1]), FUN)
JPM KFT 
  4   3
> sapply(split(subRes[,2],subRes[,1]), FUN, negate = TRUE)
JPM KFT 
  2   2 

You were close, but you needed to apply rle() to each element of your list separately, and also convert TradeResult.Currency. 你很接近,但你需要分别对列表中的每个元素应用rle() ,并转换TradeResult.Currency. to a logical vector depending indicating above 0 or not. 取决于指示0以上的逻辑向量。 Our function FUN returns just the lengths component of the object returned by rle , and we apply max() to this vector of lengths to find the longest winning run. 我们的函数FUN只返回rle返回的对象的lengths分量,我们将max()应用于这个长度向量以找到最长的获胜运行。

Note that here split isn't necessary, and you can use the other subset-by-factor-and-apply-function functions ( tapply , aggregate , etc) here: 请注意,此处不需要split ,您可以在此处使用其他tapply和因子应用函数( tapplyaggregate等):

> with(subRes, aggregate(`TradeResult.Currency.`, 
+                        by = list(Instrument = Instrument), FUN))
  Instrument x
1        JPM 4
2        KFT 3
> with(subRes, tapply(`TradeResult.Currency.`, Instrument, FUN))
JPM KFT 
  4   3

The reason the earlier version wasn't right, was because if you had a longer series of losses than wins (longer series of negative values), would result in the length of the losses series being selected. 早期版本不正确的原因是因为如果你有一系列的损失而不是胜利(更长的一系列负值),那么就会导致选择损失序列的长度。

The modified function adds a 'negate' argument to swap the meaning of the test. 修改后的函数添加了一个'negate'参数来交换测试的含义。 If we want wins, we leave TRUE and FALSE in $values as they are. 如果我们想要获胜,我们会在$values保留TRUEFALSE If we want losses, we swap TRUE and FALSE . 如果我们想要亏损,我们交换TRUEFALSE We can then use this $values component to select only the runs that correspond to wins ( negate = TRUE ) or the runs that correspond to losses ( negate = FALSE ). 然后,我们可以使用此$values组件仅选择与wins( negate = TRUE )对应的运行或与损失相对应的运行( negate = FALSE )。

Nowhere nearly as slick as Gavin's solution, but here goes. 几乎和加文的解决方案一样光滑,但这里也是如此。 My function returns the actual sequence of the longest streak. 我的函数返回最长条纹的实际序列。

inst.split <- split(inst[, 2], inst[, 1])

inst <- lapply(inst.split, function(x) {
            df.rle <- ifelse(x > 0, 1, 0)
            df.rle <- rle(df.rle)

            wh <- which(df.rle$lengths == max(df.rle$lengths))
            mx <- df.rle$lengths[wh]
            suma <- df.rle$lengths[1:wh]
            out <- x[(sum(suma) - (suma[length(suma)] - 1)):sum(suma)]
            return(out)
        })

$JPM
[1] 264 284  69 283

$KFT
[1] 107 107  56

If you want to know the longest streak per instrument, just do 如果你想知道每个乐器的最长条纹,那就行了

lapply(inst, length)

$JPM
[1] 4

$KFT
[1] 3

FOR NEGATIVE VALUES 对于负值

Notice that there's a long losing streak for KFT. 请注意,KFT有很长的连败纪录。 I've left values for JPM (JP Morgan?) alone. 我单独为JPM(摩根大通?)留下了价值。

> inst
   Instrument TradeResult.Currency.
1         JPM                    -3
2         JPM                   264
3         JPM                   284
4         JPM                    69
5         JPM                   283
6         JPM                  -219
7         JPM                   -91
8         JPM                   165
9         JPM                   -35
10        JPM                  -294
11        KFT                    -8
12        KFT                   -48
13        KFT                  -125
14        KFT                  -150
15        KFT                  -206
16        KFT                  -107
17        KFT                  -107
18        KFT                    56
19        KFT                   -26
20        KFT                   189

And this is the result of running the split data.frame through the above function. 这是通过上述函数运行拆分data.frame的结果。

$JPM
[1] 264 284  69 283

$KFT
[1]   -8  -48 -125 -150 -206 -107 -107

I've written a loop to calculate the length of the winning and losing streaks for any length of data (in this example, x is a vector of numbers that you are interested in). 我写了一个循环来计算任何长度数据的获胜和失败条纹的长度(在这个例子中,x是你感兴趣的数字向量)。 The problem with this issue is that the maximum winning or losing streak may not coincide with the longest length of the winning streak. 这个问题的问题在于最大输赢可能与连胜的最长时间不一致。 Therefore, there needs to be a separate / independent calculation: 因此,需要单独/独立的计算:

rout <- rle (x>=0) # In this calculation, 0 is considered a "win"

losel <- max(rout$lengths[!rout$values]) # Length of max losing streak
winl <- max(rout$lengths[rout$values]) # Length of max winning streak

xpostemp <- cumsum(rout$lengths)
xpos <- c(0,xpostemp)
looplength <- length(xpos)-1
tot <- rep (0,looplength)

for(j in 1:looplength){
    start <- xpos[j]+1
    end <- xpos[j+1]
    tot[j] <- sum(x[start:end])                
}
winmax <- max(tot) # Sum of largest winning steak
losemax <- min(tot) # Sum of largest losing streak

Apologies as it looks cumbersome, I'm not a full time programmer, but I think you will find that this works. 道歉,因为它看起来很麻烦,我不是一个全职程序员,但我认为你会发现这是有效的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM