r data.table 僅在每組至少有一個遞增 obs 時調整最小和最大年數

Question

我有一個包含 id、位置、開始年份、結束年份、年齡 1 和年齡 2 的數據集。 對於定義為 id、location、age1 和 age2 的每個組，我想創建新的開始和結束年份。 例如，我可能有 3 個 china 條目，包括 0-4 歲。一個是 2000-2000，另一個是 2001-2001，最后一個是 2005-2005。 由於前兩個條目中的年份以 1 遞增，因此我希望它們相應的 newstart 和 newend 為 2000-2001。 第三個條目將具有 newstart==2005 和 newend==2005，因為這不是連續年份的一部分。

我擁有的數據表類似於以下內容，除了它有數千個條目許多組合：

    id    location   start   end   age1   age2
    1     brazil     2000    2000  0      4
    1     brazil     2001    2001  0      4
    1     brazil     2002    2002  0      4
    2     argentina  1990    1991  1      1
    2     argentina  1991    1991  2      2
    2     argentina  1992    1992  2      2
    2     argentina  1993    1993  2      2
    3     belize     2001    2001  0.5    1
    3     belize     2005    2005  1      2

我想更改數據表，使其如下所示

    id    location   start   end   age1   age2  newstart   newend
    1     brazil     2000    2000  0      4     2000       2002
    1     brazil     2001    2001  0      4     2000       2002
    1     brazil     2002    2002  0      4     2000       2002
    2     argentina  1990    1991  1      1     1991       1991
    2     argentina  1991    1991  2      2     1991       1993
    2     argentina  1992    1992  2      2     1991       1993
    2     argentina  1993    1993  2      2     1991       1993
    3     belize     2001    2001  0.5    1     2001       2001
    3     belize     2005    2005  1      2     2005       2005

我嘗試創建一個變量，使用滯后來跟蹤前一年和當年的差異，然后計算這兩年之間的差異。 然后我通過放置最小起點和最大終點來創建新起點和新終點。 我發現這僅在連續年份有一組 2 時才有效。 如果我有一個更大的集合，這不起作用，因為它無法跟蹤每個分組的年份增加 1 的 obs 數量。 我相信我需要某種類型的循環。

有沒有更有效的方法來實現這一目標？

Answer 1

我們可以使用dplyr 。 按'id'分組后，取'start'和'start'的lag之差，應用rleid得到run-length-id'並創建'newstart'，'newend'作為min和max的“開始”

library(dplyr)
library(data.table)
df1 %>% 
   group_by(id) %>%
   group_by(grp =  rleid(replace_na(start - lag(start), 1)),
     .add = TRUE) %>%
   mutate(newstart = min(start), newend = max(end))

-輸出

# A tibble: 9 x 9
# Groups:   id, grp [4]
#     id location  start   end  age1  age2   grp newstart newend
#  <int> <chr>     <int> <int> <dbl> <int> <int>    <int>  <int>
#1     1 brazil     2000  2000   0       4     1     2000   2002
#2     1 brazil     2001  2001   0       4     1     2000   2002
#3     1 brazil     2002  2002   0       4     1     2000   2002
#4     2 argentina  1990  1991   1       1     1     1990   1993
#5     2 argentina  1991  1991   2       2     1     1990   1993
#6     2 argentina  1992  1992   2       2     1     1990   1993
#7     2 argentina  1993  1993   2       2     1     1990   1993
#8     3 belize     2001  2001   0.5     1     1     2001   2001
#9     3 belize     2005  2005   1       2     2     2005   2005

或者使用data.table

library(data.table)
setDT(df1)[, grp := rleid(replace_na(start - shift(start), 1))
         ][, c('newstart', 'newend') := .(min(start), max(end)), .(id, grp)][, grp := NULL]

Answer 2

數據表

你用data.table標記，所以我的第一個建議是：

library(data.table)
dat[, contiguous := rleid(c(TRUE, diff(start) == 1)), by = .(id)]
dat[, c("newstart", "newend") := .(min(start), max(end)), by = .(id, contiguous)]
dat[, contiguous := NULL]
dat
#    id  location start  end age1 age2 newstart newend
# 1:  1    brazil  2000 2000  0.0    4     2000   2002
# 2:  1    brazil  2001 2001  0.0    4     2000   2002
# 3:  1    brazil  2002 2002  0.0    4     2000   2002
# 4:  2 argentina  1990 1991  1.0    1     1990   1993
# 5:  2 argentina  1991 1991  2.0    2     1990   1993
# 6:  2 argentina  1992 1992  2.0    2     1990   1993
# 7:  2 argentina  1993 1993  2.0    2     1990   1993
# 8:  3    belize  2001 2001  0.5    1     2001   2001
# 9:  3    belize  2005 2005  1.0    2     2005   2005

基數R

相反，如果你真的只是指data.frame ，那么

dat <- transform(dat, contiguous = ave(start, id, FUN = function(a) cumsum(c(TRUE, diff(a) != 1))))
dat <- transform(dat,
  newstart = ave(start, id, contiguous, FUN = min),
  newend   = ave(end  , id, contiguous, FUN = max)
)
# Warning in FUN(X[[i]], ...) :
#   no non-missing arguments to min; returning Inf
# Warning in FUN(X[[i]], ...) :
#   no non-missing arguments to min; returning Inf
# Warning in FUN(X[[i]], ...) :
#   no non-missing arguments to max; returning -Inf
# Warning in FUN(X[[i]], ...) :
#   no non-missing arguments to max; returning -Inf

dat
#   id  location start  end age1 age2 newstart newend contiguous
# 1  1    brazil  2000 2000  0.0    4     2000   2002          1
# 2  1    brazil  2001 2001  0.0    4     2000   2002          1
# 3  1    brazil  2002 2002  0.0    4     2000   2002          1
# 4  2 argentina  1990 1991  1.0    1     1990   1993          1
# 5  2 argentina  1991 1991  2.0    2     1990   1993          1
# 6  2 argentina  1992 1992  2.0    2     1990   1993          1
# 7  2 argentina  1993 1993  2.0    2     1990   1993          1
# 8  3    belize  2001 2001  0.5    1     2001   2001          1
# 9  3    belize  2005 2005  1.0    2     2005   2005          2
dat$contiguous <- NULL

我剛剛了解到ave有趣的一點：它使用interaction(...) （所有分組變量），它將給出所有可能的組合，而不僅僅是在數據中觀察到的組合。 因此，可以使用零數據調用FUN 。 在這種情況下，它確實發出了警告。 人們可以用抑制這種function(a) suppressWarnings(min(a))而不僅僅是min 。

r data.table 僅在每組至少有一個遞增 obs 時調整最小和最大年數

問題描述

2 個解決方案

解決方案1
1 2020-10-13 04:18:16

解決方案2
1 已采納 2020-10-13 04:20:50

數據表

基數R

r data.table 僅在每組至少有一個遞增 obs 時調整最小和最大年數

問題描述

2 個解決方案

解決方案1 1 2020-10-13 04:18:16

解決方案2 1 已采納 2020-10-13 04:20:50

數據表

基數R

解決方案1
1 2020-10-13 04:18:16

解決方案2
1 已采納 2020-10-13 04:20:50