標記所有成員在R中滿足特定要求的組

Question

假設以下數據：

GroupId <-          c(1,1,1,1,2,2,2,3,3)
IndId <-            c(1,1,2,2,3,4,4,5,5)
IndGroupProperty <- c(1,2,1,2,3,3,4,5,6)
PropertyType <-     c(1,2,1,2,2,2,1,2,2)

df <- data.frame(GroupId, IndId, IndGroupProperty, PropertyType)
df

這些是多級數據，其中每個組GroupId由一個或多個個體IndId組成，其具有對一個或多個屬性IndGroupProperty訪問IndGroupProperty ，這些屬性對於它們各自的組是唯一的（即屬性1屬於組1而沒有其他組）。 這些屬性均屬於PropertyType類型。

任務是使用虛擬變量標記每一行，其中至少有一個類型1屬性屬於組中的每個個體。

對於我們的示例數據，這只是：

ValidGroup <-       c(1,1,1,1,0,0,0,0,0)
df <- data.frame(df, ValidGroup)
df

前四行標記為1，因為組（1）的每個（1,2）都可以訪問類型1屬性（1）。 后續三行屬於組（2），其中只有個人（4）可以訪問類型1屬性（4）。 因此這些都沒有標記（0）。 最后兩行也沒有收到任何標志。 組（3）僅由單個個體（5）組成，可訪問兩個類型2屬性（5,6）。

我已經研究了幾個命令： levels似乎缺乏團隊支持; nlme包中的getGroups不喜歡我的真實數據的輸入; 我想在doBy可能有一些有用的doBy ，但是summaryBy似乎並沒有把levels作為一個函數。

解決方案編輯：Henrik的dplyr解決方案包含在一個函數中：

foobar <- function(object, group, ind, type){
groupvar <- deparse(substitute(group)) 
indvar <- deparse(substitute(ind)) 
typevar <- deparse(substitute(type)) 
eval(substitute(
object[, c(groupvar, indvar, typevar)] %.%
  group_by(group, ind) %.%
  mutate(type1 = any(type == 1))  %.%
  group_by(group, add = FALSE) %.%
  mutate(ValidGroup = all(type1) * 1) %.%
  select(-type1)
  ))
}

Answer 1

你也可以嘗試ave ：

# for each individual within group, calculate number of 1s in PropertyType
v1 <- with(df, ave(PropertyType, list(GroupId, IndId), FUN = function(x) sum(x == 1)))

# within each group, check if all v1 is 1.
# The boolean result is coerced to 1 and 0 by ave.  
df$ValidGroup <- ave(v1, df$GroupId, FUN = function(x) all(x == 1))

#   GroupId IndId IndGroupProperty PropertyType ValidGroup
# 1       1     1                1            1          1
# 2       1     1                2            2          1
# 3       1     2                1            1          1
# 4       1     2                2            2          1
# 5       2     3                3            2          0
# 6       2     4                3            2          0
# 7       2     4                4            1          0
# 8       3     5                5            2          0
# 9       3     5                6            2          0

編輯為不同大小的數據集添加了dplyr替代和基准：原始數據和比原始數據大10倍和100倍的數據。

首先結束函數中的替代方案：

fun_ave <- function(df){
  v1 <- with(df, ave(PropertyType, list(GroupId, IndId), FUN = function(x) sum(x == 1)))
df$ValidGroup <- ave(v1, list(df$GroupId), FUN = function(x) all(x == 1))
df  
}

library(dplyr)
fun_dp <- function(df){
df %.%
  group_by(GroupId, IndId) %.%
  mutate(
    type1 = any(PropertyType == 1)) %.%
  group_by(GroupId, add = FALSE) %.%
  mutate(
    ValidGroup = all(type1) * 1) %.%
  select(-type1)
}


fun_by <- function(df){
  bar <- by(data=df,INDICES=df$GroupId,FUN=function(xx){
    foo <- table(xx$IndId,xx$PropertyType)
    if ( !("1" %in% colnames(foo)) ) {
      return(FALSE)   # no PropertyType=1 at all in this group
    } else {
      return(all(foo[,"1"]>0))    # return whether all IndId have an 1 entry
    }})
  cbind(df,ValidGroup = as.integer(bar[as.character(df$GroupId)]))
}

基准

原始數據：

microbenchmark(
  fun_ave(df),
  fun_dp(df),
  fun_by(df))

# Unit: microseconds
#        expr      min        lq    median        uq       max neval
# fun_ave(df)  497.964  519.8215  538.8275  563.5355   651.535   100
#  fun_dp(df)  851.861  870.6765  931.1170  968.5590  1760.360   100
#  fun_by(df) 1343.743 1412.5455 1464.6225 1581.8915 12588.607   100

在一個很小的數據集ave大約快兩倍， dplyr和速度比2.5倍以上by 。

生成一些更大的數據; 團體和個人數量的10倍

GroupId <- sample(1:30, 100, replace = TRUE)
IndId <- sample(1:50, 100, replace = TRUE)
PropertyType <- sample(1:2, 100, replace = TRUE)
df2 <- data.frame(GroupId, IndId, PropertyType)

microbenchmark(
  fun_ave(df2),
  fun_dp(df2),
  fun_by(df2))
# Unit: milliseconds
#          expr      min       lq    median        uq       max neval
#  fun_ave(df2) 2.928865 3.185259  3.270978  3.435002  5.151457   100
#   fun_dp(df2) 1.079176 1.231226  1.273610  1.352866  2.717896   100
#   fun_by(df2) 9.464359 9.855317 10.137180 10.484994 12.445680   100

dplyr比ave快三倍，比by快近10倍。

團體和個人數量的100倍

GroupId <- sample(1:300, 1000, replace = TRUE)
IndId <- sample(1:500, 1000, replace = TRUE)
PropertyType <- sample(1:2, 1000, replace = TRUE)
df2 <- data.frame(GroupId, IndId, PropertyType)

microbenchmark(
  fun_ave(df2),
  fun_dp(df2),
  fun_by(df2))

# Unit: milliseconds
# expr        min         lq    median        uq      max neval
# fun_ave(df2) 337.889895 392.983915 413.37554 441.58179 549.5516   100
#  fun_dp(df2)   3.253872   3.477195   3.58173   3.73378  75.8730   100
#  fun_by(df2)  92.248791 102.122733 104.09577 109.99285 186.6829   100

ave現在真的很失落。 dplyr比快了近30倍by ，並且要快100倍以上ave 。

Answer 2

嘗試這個：

bar <- by(data=df,INDICES=df$GroupId,FUN=function(xx){
    foo <- table(xx$IndId,xx$PropertyType)
    if ( !("1" %in% colnames(foo)) ) {
        return(FALSE)   # no PropertyType=1 at all in this group
    } else {
        return(all(foo[,"1"]>0))    # return whether all IndId have an 1 entry
    }})
cbind(df,bar[as.character(df$GroupId)])

關鍵是使用by() 通過分組變量應用函數，這里是你的df$GroupId 。 要應用的函數是匿名函數。 對於每個大塊（由分組變量定義），它創建一個table的的IndId和PropertyType的條目。 然后查看PropertyType是否出現“1” - 如果不是，則返回FALSE ，如果是，則查看每個IndId是否至少有一個“1”條目（即，是否所有條目都在“1”列中table > 0）。

我們將by()調用的結果存儲在結構bar ，該結構bar根據分組變量中的級別命名。 這反過來允許我們將結果回滾到原始data.frame 。 注意我在這里使用as.character()來確保整數被解釋為條目名稱，而不是條目號。 當事物具有可被解釋為數字的名稱時，經常會發生壞事。

如果你真的想要0-1結果而不是TRUE-FALSE ，只需添加一個as.numeric() 。

編輯。 讓我們把它變成一個函數。

foobar <- function(object, group, ind, type) {
    bar <- by(data=object,INDICES=object[,group],FUN=function(xx){
        foo <- table(xx[,ind],xx[,type])
        if ( !("1" %in% colnames(foo)) ) {
            return(FALSE)   # no PropertyType=1 at all in this group
        } else {
            return(all(foo[,"1"]>0))    # return whether all IndId have an 1 entry
        }})
    cbind(object,bar[as.character(object[,group])])
}

foobar(df,"GroupId","IndId","PropertyType")

這仍然需要目標正好是“1”，但當然這也可以作為參數包含在函數定義中。 只需確保保持包含列名的 列名和變量。

標記所有成員在R中滿足特定要求的組

問題描述

2 個解決方案

解決方案1
2 已采納 2014-04-04 10:26:16

解決方案2
1 2014-04-04 08:40:13

標記所有成員在R中滿足特定要求的組

問題描述

2 個解決方案

解決方案1 2 已采納 2014-04-04 10:26:16

解決方案2 1 2014-04-04 08:40:13

解決方案1
2 已采納 2014-04-04 10:26:16

解決方案2
1 2014-04-04 08:40:13