分組數據框中的組之間比較

Question

我正在嘗試在數據框中的后續組中的項目之間進行比較-當您知道自己在做什么時，我想這很容易...

我的數據集可以表示如下：

set.seed(1)
data <- data.frame(
 date = c(rep('2015-02-01',15), rep('2015-02-02',16), rep('2015-02-03',15)),
 id = as.character(c(1005 + sample.int(10,15,replace=TRUE), 1005 + sample.int(10,16,replace=TRUE), 1005 + sample.int(10,15,replace=TRUE)))
)

產生的數據框如下所示：

date    id
1/02/2015   1008
1/02/2015   1009
1/02/2015   1011
1/02/2015   1015
1/02/2015   1008
1/02/2015   1014
1/02/2015   1015
1/02/2015   1012
1/02/2015   1012
1/02/2015   1006
1/02/2015   1008
1/02/2015   1007
1/02/2015   1012
1/02/2015   1009
1/02/2015   1013
2/02/2015   1010
2/02/2015   1013
2/02/2015   1015
2/02/2015   1009
2/02/2015   1013
2/02/2015   1015
2/02/2015   1008
2/02/2015   1012
2/02/2015   1007
2/02/2015   1008
2/02/2015   1009
2/02/2015   1006
2/02/2015   1009
2/02/2015   1014
2/02/2015   1009
2/02/2015   1010
3/02/2015   1011
3/02/2015   1010
3/02/2015   1007
3/02/2015   1014
3/02/2015   1012
3/02/2015   1013
3/02/2015   1007
3/02/2015   1013
3/02/2015   1010

然后，我想按日期（group_by）對數據進行分組，然后在組之間進行比較之前過濾出重復項（區別）。 我想做的是每天確定添加哪些新ID和哪些ID離開。 因此，將比較第1天和第2天，以確定第2天中不在第1天的ID和第1天中但在第2天不存在的ID，然后在第2天和第3天之間進行相同的比較，以此類推。
使用anti_join（dplyr）可以很容易地完成比較，但是我不知道如何引用數據集中的各個組。

我的嘗試（或我的嘗試之一）如下所示：

data %>%
  group_by(date) %>%
  distinct(id) %>%
  do(lost = anti_join(., lag(.), by="id"))

但這當然行不通，我得到：

Error in anti_join_impl(x, y, by$x, by$y) : Can't join on 'id' x 'id' because of incompatible types (factor / logical)

我正在嘗試做的事情甚至是可能的？還是我應該寫一個笨拙的函數來做到這一點？

Answer 1

我確定我不會為自己的答案投票，但我必須說我最喜歡我的答案。 我希望得到一個使用dplyr工具解決該問題的答案，所以我一直在研究，我認為我現在有一個（半）優雅的解決方案（函數中的for循環除外）。

以相同的方式生成樣本數據集，但具有更多的數據以使其更加有趣：

set.seed(1)
data <- data.frame(
  date = c(rep('2015-02-01',15), rep('2015-02-02',16), rep('2015-02-03',15), rep('2015-02-04',15), rep('2015-02-05',15)),
  id = as.character(c(1005 + sample.int(10,15,replace=TRUE), 1005 + sample.int(10,16,replace=TRUE), 1005 + sample.int(10,15,replace=TRUE), 1005 + sample.int(10,15,replace=TRUE), 1005 + sample.int(10,15,replace=TRUE)))
)

在互聯網上搜索時，我發現了dplyr函數“ nest（）”，該函數旨在解決我所有的分組問題。 nest（）函數接受由group_by（）創建的組，並將它們滾動到數據幀列表中，因此最終將為您分組的每個變量輸入一個條目，然后為所有適合該變量的其余變量提供一個數據幀該組-這是：

dataNested <- data %>%
  group_by(date) %>%
  distinct(id) %>%
  nest()

這將產生一個非常奇怪的數據框，如下所示：

     date          data
1    2015-02-01    list(id = c(3, 4, 6, 10, 9, 7, 1, 2, 8))
2    2015-02-02    list(id = c(5, 8, 10, 4, 3, 7, 2, 1, 9))
3    2015-02-03    list(id = c(6, 5, 2, 9, 7, 8))
4    2015-02-04    list(id = c(1, 5, 8, 7, 9, 3, 4, 6, 10))
5    2015-02-05    list(id = c(3, 5, 4, 7, 8, 1, 9))

因此，列表中的索引引用了ID的列表（奇怪但為true）。

現在，這使我們可以通過索引編號viz來引用組：

dataNested$data[[2]]

返回：

# A tibble: 9 × 1
      id
  <fctr>
1   1010
2   1013
3   1015
4   1009
5   1008
6   1012
7   1007
8   1006

從這里開始，只需編寫一個函數即可完成anti_join，使我們僅留有后續各組之間的差異（這是我不感到驕傲的部分，並且實際上開始顯示出我缺乏R技能），這很簡單隨時提出改進建議）：

## Function departed() - returns the id's that were dropped from each subsequent time period
departed <- function(groups) {
  tempList <- vector("list", nrow(groups))
  # Loop through the groups and do an anti_join between each
  for (i in seq(1, nrow(groups) - 1)) {
  tempList[[i + 1]] <-
  anti_join(data.frame(groups$data[[i]]),  data.frame(groups$data[[i + 1]]), by = "id")

  }
  return(tempList)
}

將此函數應用於我們的嵌套數據將產生已故ID列表列表：

> departedIDs <- dataNested %>% departed()

> departedIDs
[[1]]
NULL

[[2]]
    id
1 1011

[[3]]
    id
1 1006
2 1008
3 1009
4 1015

[[4]]
    id
1 1007

[[5]]
    id
1 1011
2 1015

我希望這個答案能幫助其他與我的大腦運作方式相同的人。

Answer 2

只需將輸入stringsAsFactors = FALSE添加到您的數據stringsAsFactors = FALSE即可。 這將使您的代碼運行：盡管不確定輸出的結果是否是您想要的結果。 要查看整個結果，請將其通過管道傳輸到data.frame中，然后查看其是否為您要的內容。 希望這可以幫助。

 set.seed(1)
 data <- data.frame(
    date = c(rep('2015-02-01',15), rep('2015-02-02',16), rep('2015-02-3',15)),
    id = as.character(c(1005 + sample.int(10,15,replace=TRUE), 1005 + sample.int(10,16,replace=TRUE), 1005 + sample.int(10,15,replace=TRUE))),stringsAsFactors = FALSE)


data %>%
  group_by(date) %>%
  distinct(id) %>%
  do(lost = anti_join(., lag(.), by="id"))%>%data.frame()

Answer 3

對數據進行一些操作並進行合並可能會滿足您的要求。 像這樣

df <- unique(data)
df$date <- as.Date(df$date)
df$leftdate <- df$date + 1
df$prevdate <- df$date - 1
df2 <- cbind(df[,c("date","id")],flag =  1)

# merge the dataframe so that each day would attempt to join the next day
dfleft <- merge(df,df2,by.x = c("leftdate","id"),by.y = c("date","id"),all.x= TRUE)
# if there is no common id between a day and the next day, the merge returns NA, which is the desired results for those who left
dfleft <- dfleft[is.na(dfleft$flag),c("leftdate","id")]

# Here, you reverse the logic to find those who show up today but weren't there yesterday
dfnew <- merge(df,df2,by.x = c("prevdate","id"),by.y = c("date","id"),all.x= TRUE)
dfnew <- dfnew[is.na(dfnew$flag),c("date","id")]

Answer 4

我對這個問題的理解是，數據顯示每個日期的ID，因此我們要遍歷所有日期，以比較該日期的ID和前一個日期的ID。

首先獲取u的唯一行，並將id轉換為數字。 然后按date將id划分為s並定義一個函數diffs ，該函數將使用刪除的ID的負數生成添加ID的數字矢量。 lapply其應用於seq_along（第一個組件除外），因為它沒有先前的組件。 不使用任何軟件包。

u <- unique(data)
u$id <- as.numeric(as.character(u$id))
s <- split(u$id, u$date)
diffs <- function(i) c(setdiff(s[[i]], s[[i-1]]), - setdiff(s[[i-1]], s[[i]]))
diffs_list <- setNames(lapply(seq_along(s)[-1], diffs), names(s)[-1])

給予：

> diffs_list
$`2015-02-02`
[1]  1010 -1011

$`2015-02-03`
[1]  1011 -1015 -1009 -1008 -1006

或者如果您想將數據框作為輸出

setNames(stack(diffs_list), c("id", "date"))

給予：

     id       date
1  1010 2015-02-02
2 -1011 2015-02-02
3  1011 2015-02-03
4 -1015 2015-02-03
5 -1009 2015-02-03
6 -1008 2015-02-03
7 -1006 2015-02-03

磁珠

這也可以使用magrittr包這樣在那里表示diffs如上所定義。

library(magrittr)

data %>%
     unique %>%
     transform(id = as.numeric(as.character(id))) %>%
     { split(.$id, .$date) } %>%
     { setNames(lapply(seq_along(.)[-1], diffs), names(.)[-1]) }

注意：我已經用-03替換了data$date -3。

分組數據框中的組之間比較

問題描述

4 個解決方案

解決方案1
1 已采納 2017-08-22 09:47:21

解決方案2
0 2017-08-20 07:02:28

解決方案3
0 2017-08-20 08:18:02

解決方案4
0 2017-08-20 12:54:03

分組數據框中的組之間比較

問題描述

4 個解決方案

解決方案1 1 已采納 2017-08-22 09:47:21

解決方案2 0 2017-08-20 07:02:28

解決方案3 0 2017-08-20 08:18:02

解決方案4 0 2017-08-20 12:54:03

解決方案1
1 已采納 2017-08-22 09:47:21

解決方案2
0 2017-08-20 07:02:28

解決方案3
0 2017-08-20 08:18:02

解決方案4
0 2017-08-20 12:54:03