[英]How to find max and min within sequence of values in a column in R?
This problem might be trivial but I am finding it difficult to solve it. 这个问题可能微不足道,但我发现很难解决。 Please guide me.
请指导我。
Following is sample data: 以下是示例数据:
structure(list(Vehicle.ID2 = c("39-25", "39-25", "39-25", "39-25",
"39-25", "39-25", "39-25", "39-25", "39-25", "39-25", "39-25",
"39-25", "39-25", "39-25", "39-25", "39-25", "39-25", "39-25",
"39-25", "39-25", "39-25", "39-25", "39-25", "39-25", "39-25",
"39-25", "39-25", "39-25", "39-25", "39-25", "39-25", "39-25",
"39-25", "39-25", "39-25", "39-25", "39-25", "39-25", "39-25"
), OC_DV = c(".", ".", ".", ".", ".", "CLDV", ".", ".", ".",
".", ".", ".", ".", ".", ".", "OPDV", ".", ".", ".", ".", ".",
".", ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", ".",
".", "CLDV", ".", ".", "."), frspacing = c(35.83373, 35.75742,
35.70391, 35.67694, 35.67792, 35.70669, 35.7619, 35.84096, 35.93962,
36.05109, 36.16704, 36.28056, 36.3861, 36.47762, 36.5485, 36.59359,
36.61402, 36.61791, 36.61383, 36.60651, 36.59694, 36.58372, 36.56525,
36.54044, 36.50771, 36.46458, 36.40831, 36.33713, 36.25086, 36.15089,
36.04004, 35.92236, 35.80322, 35.68935, 35.58883, 35.51032, 35.4618,
35.4492, 35.47479)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-39L), .Names = c("Vehicle.ID2", "OC_DV", "frspacing"))
I want to find the max and min of set of values in frspacing
between the labels CLDV
and OPDV
in the column OC_DV
. 我想找到一组值的最大值和最小值在
frspacing
标签之间CLDV
和OPDV
列OC_DV
。 Then I want to find their difference. 然后我想找到它们的区别。
Following are max and mins: 以下是最大和最小:
Group Max Min
1 CLDV-OPDV 36.54 35.70
2 OPDV-CLDV 36.62 35.59
Following are the absolute differences (max of 1st grp - min of 2nd group AND vice versa): 以下是绝对差异(第一组的最大值-第二组的最小值,反之亦然):
1 0.95
2 0.92
I don't have any code to show what I tried because honestly I don't know how to approach this problem. 我没有任何代码可以显示我的尝试,因为老实说我不知道如何解决这个问题。 Obviously simple
max
or min
by column won't work. 显然,简单的按列的
max
或min
将不起作用。 I am using dplyr
and didn't find anything relevant. 我正在使用
dplyr
,没有找到任何相关内容。
library(zoo) # for na.locf
library(dplyr)
df[df=="."] = NA
df$group = paste((na.locf(df$OC_DV, na.rm = FALSE)), lead(na.locf(df$OC_DV, na.rm = FALSE, fromLast = TRUE)), sep = "-")
df %>% group_by(group) %>%
summarise(Max = max(frspacing), Min = min(frspacing)) %>%
filter(!grepl("NA",group ))
Source: local data frame [2 x 3]
group Max Min
(chr) (dbl) (dbl)
1 CLDV-OPDV 36.54850 35.70669
2 OPDV-CLDV 36.61791 35.58883
With multiple values I would count the changes und use it as another grouping variable: (I duplicated the data in this example) 使用多个值时,我将计算更改并将其用作另一个分组变量:(在此示例中,我复制了数据)
df$group2 = NA
df$group2[which(df$group != lag(df$group))] = 1:length(which(df$group != lag(df$group)))
df$group2 = na.locf(df$group2, na.rm = FALSE)
df %>% group_by(group, group2) %>%
summarise(Max = max(frspacing), Min = min(frspacing)) %>%
filter(!grepl("NA",group ))
Source: local data frame [5 x 4]
Groups: group [3]
group group2 Max Min
(chr) (int) (dbl) (dbl)
1 CLDV-CLDV 3 38.09082 34.30454
2 CLDV-OPDV 1 36.54850 35.70669
3 CLDV-OPDV 4 38.90356 34.08951
4 OPDV-CLDV 2 36.61791 35.58883
5 OPDV-CLDV 5 38.18983 34.27874
But if the combination of OC_DV
is distinct in every Vehicle.ID2
you can simply paste the ID in group... 但是,如果每个
Vehicle.ID2
中OC_DV
的组合都不同,则只需将ID粘贴到组中即可...
d <- your_dput
# Build your subsetted dataframes
e <- d[grep("CLDV", d$OC_DV)[1]: grep("OPDV", d$OC_DV),]
f <- d[(grep("OPDV", d$OC_DV): grep("CLDV", d$OC_DV)[2]),]
# Make the diff() calls
diff(c(max(e$frspacing), min(f$frspacing)))
diff(c(max(f$frspacing), min(e$frspacing)))
My values are not quiet the same as yours, you can adjust the grep values manually depending on how you want to handle boundary inclusion/exclusion. 我的值与您的值不一样,您可以根据要处理边界包含/排除的方式手动调整grep值。
Below is a base R solution: 以下是基本的R解决方案:
MaxMinSeq <- function(df) {
myInd <- which(df$OC_DV != ".")
myVals <- df$frspacing
myTitles <- df$OC_DV[myInd]
myLen <- length(myInd)-1L
NewDf <- as.data.frame(t(sapply(1:myLen, function(x) {
list(Group = paste(c(myTitles[x],"-",myTitles[x+1L]), collapse = ""),
Max = max(myVals[myInd[x]:(myInd[x+1L]-1L)]),
Min = min(myVals[myInd[x]:(myInd[x+1L]-1L)]))})))
for (i in 1:3) {NewDf[,i] <- unlist(NewDf[,i])}
NewDf
}
df2 <- MaxMinSeq(df)
df2
Group Max Min
1 CLDV-OPDV 36.54850 35.70669
2 OPDV-CLDV 36.61791 35.58883
This is a good bit faster than the dplyr
solution posted above. 这比上面发布的
dplyr
解决方案快了dplyr
。 Observe: 注意:
TestDplyr <- function(df) {
df[df=="."] <- NA
df$group <- paste((na.locf(df$OC_DV, na.rm = FALSE)), lead(na.locf(df$OC_DV, na.rm = FALSE, fromLast = TRUE)), sep = "-")
df$group2 <- NA
df$group2[which(df$group != lag(df$group))] <- 1:length(which(df$group != lag(df$group)))
df$group2 <- na.locf(df$group2, na.rm = FALSE)
df %>% group_by(group, group2) %>%
summarise(Max = max(frspacing), Min = min(frspacing)) %>%
filter(!grepl("NA",group ))
}
microbenchmark(Joseph = MaxMinSeq(df), Cabana = TestDplyr(df))
Unit: microseconds
expr min lq mean median uq max neval
Joseph 338.671 377.6695 405.0257 405.9945 429.188 496.718 100
Cabana 2622.336 2698.2810 2890.5430 2765.6045 2977.427 7772.180 100
Here is a really big example: 这是一个非常重要的例子:
myDfs <- lapply(1:10000, function(x) df)
bigDf <- do.call(rbind, myDfs)
bigDf$frspacing[40:nrow(bigDf)] <- runif((nrow(bigDf)-39), 10, 100)
a <- MaxMinSeq(bigDf)
b <- TestDplyr(bigDf)
b <- b[order(b$group2),]
identical(a$Max, b$Max)
[1] TRUE
identical(a$Min, b$Min)
[1] TRUE
system.time(TestDplyr(bigDf))
user system elapsed
1.54 0.00 1.54
system.time(MaxMinSeq(bigDf))
user system elapsed
0.3 0.0 0.3
As for the second part of the question, I'm not sure how general the OP would like the answer to be especially when there are more than two different final pairings. 至于问题的第二部分,我不确定OP希望答案有多一般,尤其是当有两个以上不同的最终配对时。 For example does the OP want to find the max of one row and compare that to the min of the min of all rows, or do we simply compare neighbors?
例如,OP是否要查找一行的最大值并将其与所有行的最小值进行比较,还是只是比较邻居? The function below takes the first approach (ie the general approach).
下面的函数采用第一种方法(即通用方法)。
GetDiff <- function(df) {
df2 <- cbind(df, t(sapply(1:nrow(df), function(x) {
c(rowMin = min(df[x,2:3]),
rowMax = max(df[x,2:3]))})))
myRows <- 1:nrow(df)
sapply(myRows, function(x) df2$rowMax[x] - min(df2$rowMin[-x]))
}
GetDiff(df2) ## df2 comes from above
[1] 0.95967 0.91122
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.