简体   繁体   English

NA 两行之间的时差

[英]Time difference between two rows with NA

I have a dataframe similar to the following (although with 80000 rows) where first column is "Date.Time" and the rest of columns are variables that have some values with NA.我有一个类似于以下的数据框(尽管有 80000 行),其中第一列是“Date.Time”,其余列是具有一些 NA 值的变量。 As an reprex example:作为reprex示例:

df <- data.frame(
Date= c("2020-01-01 09:50:00", "2020-01-01 09:51:30", "2020-01-01 09:53:00", "2020-01-01 09:54:00",
"2020-01-01 09:55:00", "2020-01-01 09:57:30", "2020-01-01 09:59:00", "2020-01-01 10:01:00"),
Variable1 = c(10,15,NA,25,22,10,11,NA),
Variable2 = c(1,NA,2,5,8,6,8,NA))

What I need is the maximum time interval between 2 rows without NA.我需要的是没有 NA 的 2 行之间的最大时间间隔。 On the previous example, the values I would need are for Variable1 and Date[7,1]-Date[4,1] (since Date[2,1]-Date[1,1] is a time interval smaller), while for Variable2 it would be Date[7,1]-Date [3,1]在前面的例子中,我需要的值是 Variable1 和 Date[7,1]-Date[4,1](因为 Date[2,1]-Date[1,1] 是一个更小的时间间隔),而对于 Variable2,它将是 Date[7,1]-Date [3,1]

I've been trying with rle() function, obtaining for each variable the intervals of NA and not NA:我一直在尝试使用 rle() 函数,为每个变量获取 NA 而不是 NA 的间隔:

is.na.rle222 <- rle(is.na(df[, "Variable1"]))

But I only obtain the size of the biggest interval without a link to dates.但我只获得最大间隔的大小,而没有日期链接。

Hope my question is clear.希望我的问题很清楚。

Thanks in advance提前致谢

You can split Date and get the maximum difference using range and diff per group like:您可以split Date并使用每个组的rangediff获得最大差异,例如:

i <- cumsum(c(1, abs(diff(is.na(df$Variable1)))))
x <- lapply(split(as.POSIXct(df$Date), i), function(x) diff(range(x)))
x[[which.max(x)]]
#Time difference of 5 mins

Using the logic from @GKi with dplyr and trying to be more explicit:将@GKi 中的逻辑与 dplyr 一起使用并尝试更明确:

require(dplyr)
(
  df
  %>% mutate(Var1_interval_grp = cumsum(c(1, abs(diff(is.na(df$Variable1))))),
             Var2_interval_grp = cumsum(c(1, abs(diff(is.na(df$Variable2))))))
  %>% group_by(Var1_interval_grp)
  %>% mutate(Range_Var1 = diff(range(as.POSIXct(Date))))
  %>% ungroup
  %>% group_by(Var2_interval_grp)
  %>% mutate(Range_Var2 = diff(range(as.POSIXct(Date))))
  %>% ungroup
  %>% select(! contains("grp"))
) -> df

The output is now:现在的输出是:

> df 
# A tibble: 8 x 5
  Date                Variable1 Variable2 Range_Var1 Range_Var2
  <chr>                   <dbl>     <dbl> <drtn>     <drtn>    
1 2020-01-01 09:50:00        10         1  90 secs     0 secs  
2 2020-01-01 09:51:30        15        NA  90 secs     0 secs  
3 2020-01-01 09:53:00        NA         2   0 secs   360 secs  
4 2020-01-01 09:54:00        25         5 300 secs   360 secs  
5 2020-01-01 09:55:00        22         8 300 secs   360 secs  
6 2020-01-01 09:57:30        10         6 300 secs   360 secs  
7 2020-01-01 09:59:00        11         8 300 secs   360 secs  
8 2020-01-01 10:01:00        NA        NA   0 secs     0 secs  

And it is easy to get the dates where max is reached:并且很容易获得达到最大值的日期:

(
  df
  %>% filter(Range_Var1 == max(Range_Var1))
  %>% pull(Date)
) 

which produces:它产生:

[1] "2020-01-01 09:54:00" "2020-01-01 09:55:00" "2020-01-01 09:57:30"
[4] "2020-01-01 09:59:00"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM