簡體   English   中英

按連續值和組進行匯總

[英]aggregate by consecutive values and group

在以下數據集中,我按自行車計數等於零的實例過濾了JSON間隔。 station_summary_id代表一個時間間隔,並以連續整數遞增(在該示例中,您看到64129與“ 2014-10-01 07:00:00”關聯,然后64130與“ 2014-10-01 07:10:00”關聯“,依此類推station_id是工作站的唯一ID。

我的目標是:根據station_id查找最長的連續整數鏈-換句話說-找出每個工作站為空的最長時間段。 據我所知,這需要先分組station_id ,然后計數的最長連續序列station_summary_id但我不知道如何為所有基站ID自動執行此。

> dim(data)
[1] 307039      7


> head(data)
      station_id status available_bike_count          created_at station_summary_id month year
13694          2 Active                    0 2014-10-01 07:00:00              64129    10 2014
13702         10 Active                    0 2014-10-01 07:00:00              64129    10 2014
13706         14 Active                    0 2014-10-01 07:00:00              64129    10 2014
13710         18 Active                    0 2014-10-01 07:00:00              64129    10 2014
13713         21 Active                    0 2014-10-01 07:00:00              64129    10 2014
13728         36 Active                    0 2014-10-01 07:00:00              64129    10 2014

可重現的示例:

> dput(dat)
structure(list(station_id = c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L), status = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L), .Label = "Active", class = "factor"), available_bike_count = c(0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), station_summary_id = c(64129L, 
64130L, 64131L, 64132L, 64133L, 64134L, 64136L, 64138L, 64139L, 
64140L, 64141L, 64142L, 64143L, 64144L, 64145L, 64146L, 64147L, 
64148L, 64149L, 64150L, 64152L, 64161L, 64162L, 64170L, 64273L, 
64322L, 64324L, 64341L, 64884L, 64886L, 64896L, 64897L, 64898L, 
64899L, 64900L, 64901L, 64902L, 64903L, 64904L, 64905L, 64906L, 
64907L, 64908L, 64909L, 64910L, 64911L, 64912L, 64913L, 64917L, 
64918L, 65214L, 65219L, 66314L, 66439L, 66450L, 66583L, 66587L, 
66589L, 66600L, 66872L, 66880L, 67037L, 67048L, 82854L, 82855L, 
82856L, 82857L, 82858L, 82859L, 82860L, 82861L, 82862L, 82863L, 
82867L, 82868L), month = c(10L, 10L, 10L, 10L, 10L, 10L, 10L, 
10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 
10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 
10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 
10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 
10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 
10L, 10L, 10L), year = c(2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 
2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 
2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 
2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 
2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 
2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 
2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 
2014L, 2014L, 2014L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 
2015L, 2015L, 2015L, 2015L, 2015L, 2015L)), .Names = c("station_id", 
"status", "available_bike_count", "station_summary_id", "month", 
"year"), row.names = c(NA, -75L), class = "data.frame")

?rle為更好地了解行程長度編碼的可能用途。

使用新數據:

> max( rle( diff(dat$station_summary_id) )$lengths )
[1] 12

在修改后的示例中有多個station_id,我發現aggregate工作得很好:

 aggregate( dat$station_summary_id, dat['station_id'], FUN= function(d) max( rle( diff(d) )$lengths ) )
#---------
  station_id  x
1          2 12
2          3 17
3          4  9

這也可以通過data.table語法成功完成:

dat <- setDT(dat)
dat[,   max( rle( diff(station_summary_id) )$lengths ) , by='station_id']
#-----
   station_id V1
1:          2 12
2:          3 17
3:          4  9

您可以使用dplyrdata.tablebase R通過工作站ID查找最大持續時間。 參見@ 42在調用中心提到函數rle

#dplyr
library(dplyr)
data %>% group_by(station_id) %>% 
  summarise(with(rle(station_summary_id), values[which.max(lengths)]))

#data.table
library(data.table)
setDT(data)[,list(with(rle(station_summary_id),
               values[which.max(lengths)])),by=station_id]

#base R
lapply(split(data$station_summary_id, data$station_id), 
       function(x) with(rle(x), values[which.max(lengths)]))

編輯

使用新數據:

dt[,with(rle(diff(station_summary_id) > 1), max(lengths[!values])), by=station_id]

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM