简体   繁体   English

如何在R的一列中的值序列中找到最大值和最小值?

[英]How to find max and min within sequence of values in a column in R?

This problem might be trivial but I am finding it difficult to solve it. 这个问题可能微不足道,但我发现很难解决。 Please guide me. 请指导我。

Data 数据

Following is sample data: 以下是示例数据:

structure(list(Vehicle.ID2 = c("39-25", "39-25", "39-25", "39-25", 
"39-25", "39-25", "39-25", "39-25", "39-25", "39-25", "39-25", 
"39-25", "39-25", "39-25", "39-25", "39-25", "39-25", "39-25", 
"39-25", "39-25", "39-25", "39-25", "39-25", "39-25", "39-25", 
"39-25", "39-25", "39-25", "39-25", "39-25", "39-25", "39-25", 
"39-25", "39-25", "39-25", "39-25", "39-25", "39-25", "39-25"
), OC_DV = c(".", ".", ".", ".", ".", "CLDV", ".", ".", ".", 
".", ".", ".", ".", ".", ".", "OPDV", ".", ".", ".", ".", ".", 
".", ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", 
".", "CLDV", ".", ".", "."), frspacing = c(35.83373, 35.75742, 
35.70391, 35.67694, 35.67792, 35.70669, 35.7619, 35.84096, 35.93962, 
36.05109, 36.16704, 36.28056, 36.3861, 36.47762, 36.5485, 36.59359, 
36.61402, 36.61791, 36.61383, 36.60651, 36.59694, 36.58372, 36.56525, 
36.54044, 36.50771, 36.46458, 36.40831, 36.33713, 36.25086, 36.15089, 
36.04004, 35.92236, 35.80322, 35.68935, 35.58883, 35.51032, 35.4618, 
35.4492, 35.47479)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-39L), .Names = c("Vehicle.ID2", "OC_DV", "frspacing"))  

What I want to do 我想做的事

I want to find the max and min of set of values in frspacing between the labels CLDV and OPDV in the column OC_DV . 我想找到一组值的最大值和最小值在frspacing标签之间CLDVOPDVOC_DV Then I want to find their difference. 然后我想找到它们的区别。

Desired Output 期望的输出

Following are max and mins: 以下是最大和最小:

  Group      Max    Min
1 CLDV-OPDV 36.54   35.70
2 OPDV-CLDV 36.62   35.59  

Following are the absolute differences (max of 1st grp - min of 2nd group AND vice versa): 以下是绝对差异(第一组的最大值-第二组的最小值,反之亦然):

1 0.95
2 0.92

I don't have any code to show what I tried because honestly I don't know how to approach this problem. 我没有任何代码可以显示我的尝试,因为老实说我不知道​​如何解决这个问题。 Obviously simple max or min by column won't work. 显然,简单的按列的maxmin将不起作用。 I am using dplyr and didn't find anything relevant. 我正在使用dplyr ,没有找到任何相关内容。

 library(zoo) # for na.locf
 library(dplyr)

 df[df=="."] = NA
 df$group = paste((na.locf(df$OC_DV, na.rm = FALSE)), lead(na.locf(df$OC_DV, na.rm = FALSE, fromLast = TRUE)), sep = "-")

 df %>% group_by(group) %>% 
   summarise(Max = max(frspacing), Min = min(frspacing)) %>% 
   filter(!grepl("NA",group ))

Source: local data frame [2 x 3]

      group      Max      Min
      (chr)    (dbl)    (dbl)
1 CLDV-OPDV 36.54850 35.70669
2 OPDV-CLDV 36.61791 35.58883

With multiple values I would count the changes und use it as another grouping variable: (I duplicated the data in this example) 使用多个值时,我将计算更改并将其用作另一个分组变量:(在此示例中,我复制了数据)

df$group2 = NA
df$group2[which(df$group != lag(df$group))] = 1:length(which(df$group != lag(df$group)))
df$group2 = na.locf(df$group2, na.rm = FALSE)

df %>% group_by(group, group2) %>% 
  summarise(Max = max(frspacing), Min = min(frspacing)) %>% 
   filter(!grepl("NA",group ))

Source: local data frame [5 x 4]
Groups: group [3]

      group group2      Max      Min
      (chr)  (int)    (dbl)    (dbl)
1 CLDV-CLDV      3 38.09082 34.30454
2 CLDV-OPDV      1 36.54850 35.70669
3 CLDV-OPDV      4 38.90356 34.08951
4 OPDV-CLDV      2 36.61791 35.58883
5 OPDV-CLDV      5 38.18983 34.27874

But if the combination of OC_DV is distinct in every Vehicle.ID2 you can simply paste the ID in group... 但是,如果每个Vehicle.ID2OC_DV的组合都不同,则只需将ID粘贴到组中即可...

d <- your_dput
# Build your subsetted dataframes
e <- d[grep("CLDV", d$OC_DV)[1]: grep("OPDV", d$OC_DV),]
f <- d[(grep("OPDV", d$OC_DV): grep("CLDV", d$OC_DV)[2]),]
# Make the diff() calls
diff(c(max(e$frspacing), min(f$frspacing)))
diff(c(max(f$frspacing), min(e$frspacing)))

My values are not quiet the same as yours, you can adjust the grep values manually depending on how you want to handle boundary inclusion/exclusion. 我的值与您的值不一样,您可以根据要处理边界包含/排除的方式手动调整grep值。

Below is a base R solution: 以下是基本的R解决方案:

MaxMinSeq <- function(df) {
    myInd <- which(df$OC_DV != ".")
    myVals <- df$frspacing
    myTitles <- df$OC_DV[myInd]
    myLen <- length(myInd)-1L
    NewDf <- as.data.frame(t(sapply(1:myLen, function(x) {
               list(Group = paste(c(myTitles[x],"-",myTitles[x+1L]), collapse = ""),
                   Max = max(myVals[myInd[x]:(myInd[x+1L]-1L)]),
                   Min = min(myVals[myInd[x]:(myInd[x+1L]-1L)]))})))
    for (i in 1:3) {NewDf[,i] <- unlist(NewDf[,i])}
    NewDf
}

df2 <- MaxMinSeq(df)
df2
      Group      Max      Min
1 CLDV-OPDV 36.54850 35.70669
2 OPDV-CLDV 36.61791 35.58883

This is a good bit faster than the dplyr solution posted above. 这比上面发布的dplyr解决方案快了dplyr Observe: 注意:

TestDplyr <- function(df) {
    df[df=="."] <- NA
    df$group <- paste((na.locf(df$OC_DV, na.rm = FALSE)), lead(na.locf(df$OC_DV, na.rm = FALSE, fromLast = TRUE)), sep = "-")

    df$group2 <- NA
    df$group2[which(df$group != lag(df$group))] <- 1:length(which(df$group != lag(df$group)))
    df$group2 <- na.locf(df$group2, na.rm = FALSE)

    df %>% group_by(group, group2) %>% 
        summarise(Max = max(frspacing), Min = min(frspacing)) %>% 
        filter(!grepl("NA",group ))
}

microbenchmark(Joseph = MaxMinSeq(df), Cabana = TestDplyr(df))
Unit: microseconds
expr      min        lq      mean    median       uq      max neval
Joseph  338.671  377.6695  405.0257  405.9945  429.188  496.718   100
Cabana 2622.336 2698.2810 2890.5430 2765.6045 2977.427 7772.180   100

Here is a really big example: 这是一个非常重要的例子:

myDfs <- lapply(1:10000, function(x) df)
bigDf <- do.call(rbind, myDfs)
bigDf$frspacing[40:nrow(bigDf)] <- runif((nrow(bigDf)-39), 10, 100)

a <- MaxMinSeq(bigDf)
b <- TestDplyr(bigDf)
b <- b[order(b$group2),]

identical(a$Max, b$Max)
[1] TRUE
identical(a$Min, b$Min)
[1] TRUE

system.time(TestDplyr(bigDf))
 user  system elapsed 
 1.54    0.00    1.54 
system.time(MaxMinSeq(bigDf))
 user  system elapsed 
  0.3     0.0     0.3

As for the second part of the question, I'm not sure how general the OP would like the answer to be especially when there are more than two different final pairings. 至于问题的第二部分,我不确定OP希望答案有多一般,尤其是当有两个以上不同的最终配对时。 For example does the OP want to find the max of one row and compare that to the min of the min of all rows, or do we simply compare neighbors? 例如,OP是否要查找一行的最大值并将其与所有行的最小值进行比较,还是只是比较邻居? The function below takes the first approach (ie the general approach). 下面的函数采用第一种方法(即通用方法)。

GetDiff <- function(df) {
    df2 <- cbind(df, t(sapply(1:nrow(df), function(x) {
                        c(rowMin = min(df[x,2:3]),
                          rowMax = max(df[x,2:3]))})))
    myRows <- 1:nrow(df)
    sapply(myRows, function(x) df2$rowMax[x] - min(df2$rowMin[-x]))
}

GetDiff(df2)   ## df2 comes from above
[1] 0.95967 0.91122

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM