简体   繁体   English

分组数据框中的组之间比较

[英]Comparing between groups in grouped dataframe

I am trying to perform a comparison between items in subsequent groups in a dataframe - I guess this is pretty easy when you know what you are doing... 我正在尝试在数据框中的后续组中的项目之间进行比较-当您知道自己在做什么时,我想这很容易...

My data set can be represented as follows: 我的数据集可以表示如下:

set.seed(1)
data <- data.frame(
 date = c(rep('2015-02-01',15), rep('2015-02-02',16), rep('2015-02-03',15)),
 id = as.character(c(1005 + sample.int(10,15,replace=TRUE), 1005 + sample.int(10,16,replace=TRUE), 1005 + sample.int(10,15,replace=TRUE)))
)

Which yields a dataframe that looks like: 产生的数据框如下所示:

date    id
1/02/2015   1008
1/02/2015   1009
1/02/2015   1011
1/02/2015   1015
1/02/2015   1008
1/02/2015   1014
1/02/2015   1015
1/02/2015   1012
1/02/2015   1012
1/02/2015   1006
1/02/2015   1008
1/02/2015   1007
1/02/2015   1012
1/02/2015   1009
1/02/2015   1013
2/02/2015   1010
2/02/2015   1013
2/02/2015   1015
2/02/2015   1009
2/02/2015   1013
2/02/2015   1015
2/02/2015   1008
2/02/2015   1012
2/02/2015   1007
2/02/2015   1008
2/02/2015   1009
2/02/2015   1006
2/02/2015   1009
2/02/2015   1014
2/02/2015   1009
2/02/2015   1010
3/02/2015   1011
3/02/2015   1010
3/02/2015   1007
3/02/2015   1014
3/02/2015   1012
3/02/2015   1013
3/02/2015   1007
3/02/2015   1013
3/02/2015   1010

Then I want to group the data by date (group_by) and then filter out duplicates (distinct) before comparing between the groups. 然后,我想按日期(group_by)对数据进行分组,然后在组之间进行比较之前过滤出重复项(区别)。 What I want to do is determine from day to day which new id's are added and which id's leave. 我想做的是每天确定添加哪些新ID和哪些ID离开。 So day 1 and day 2 would be compared to determine the id's in day 2 that were not in day 1 and the id's that were in day 1 but not present in day 2, then do the same comparisons between day 2 and day 3 etc. 因此,将比较第1天和第2天,以确定第2天中不在第1天的ID和第1天中但在第2天不存在的ID,然后在第2天和第3天之间进行相同的比较,以此类推。
The comparison can be done very easily using an anti_join (dplyr) but I don't know how to reference individual groups in the dataset. 使用anti_join(dplyr)可以很容易地完成比较,但是我不知道如何引用数据集中的各个组。

My attempt (or one of my attempts) looks like: 我的尝试(或我的尝试之一)如下所示:

data %>%
  group_by(date) %>%
  distinct(id) %>%
  do(lost = anti_join(., lag(.), by="id"))

But of course this does not work, I just get: 但这当然行不通,我得到:

Error in anti_join_impl(x, y, by$x, by$y) : Can't join on 'id' x 'id' because of incompatible types (factor / logical)

Is what I am attempting to do even possible or should I be looking at writing a clunky function to do it? 我正在尝试做的事情甚至是可能的?还是我应该写一个笨拙的函数来做到这一点?

I'm sure I don't get to vote for my own answer but I must say that I like mine the best. 我确定我不会为自己的答案投票,但我必须说我最喜欢我的答案。 I was hoping to get an answer that used the dplyr tools to solve the problem so I kept researching and I think I now have a (semi) elegant solution (apart from the for loop in my function). 我希望得到一个使用dplyr工具解决该问题的答案,所以我一直在研究,我认为我现在有一个(半)优雅的解决方案(函数中的for循环除外)。

Generating the sample data set the same way but with more data to make it more interesting: 以相同的方式生成样本数据集,但具有更多的数据以使其更加有趣:

set.seed(1)
data <- data.frame(
  date = c(rep('2015-02-01',15), rep('2015-02-02',16), rep('2015-02-03',15), rep('2015-02-04',15), rep('2015-02-05',15)),
  id = as.character(c(1005 + sample.int(10,15,replace=TRUE), 1005 + sample.int(10,16,replace=TRUE), 1005 + sample.int(10,15,replace=TRUE), 1005 + sample.int(10,15,replace=TRUE), 1005 + sample.int(10,15,replace=TRUE)))
)

Searching through the interweb I found the dplyr function 'nest()' which looked to solve all my grouping issues. 在互联网上搜索时,我发现了dplyr函数“ nest()”,该函数旨在解决我所有的分组问题。 The nest() function takes the groups created by group_by() and rolls them into a list of data frames so you end up with one entry for each variable you have grouped on and then a data frame for all of the remaining variables that fit into that group - here it is: nest()函数接受由group_by()创建的组,并将它们滚动到数据帧列表中,因此最终将为您分组的每个变量输入一个条目,然后为所有适合该变量的其余变量提供一个数据帧该组-这是:

dataNested <- data %>%
  group_by(date) %>%
  distinct(id) %>%
  nest()

Which yields a fairly strange dataframe that looks like: 这将产生一个非常奇怪的数据框,如下所示:

     date          data
1    2015-02-01    list(id = c(3, 4, 6, 10, 9, 7, 1, 2, 8))
2    2015-02-02    list(id = c(5, 8, 10, 4, 3, 7, 2, 1, 9))
3    2015-02-03    list(id = c(6, 5, 2, 9, 7, 8))
4    2015-02-04    list(id = c(1, 5, 8, 7, 9, 3, 4, 6, 10))
5    2015-02-05    list(id = c(3, 5, 4, 7, 8, 1, 9))

Whereby the indexes in the lists reference a list of the id's (strange but true). 因此,列表中的索引引用了ID的列表(奇怪但为true)。

This now allows us to reference the groups by index number viz: 现在,这使我们可以通过索引编号viz来引用组:

dataNested$data[[2]]

returns: 返回:

# A tibble: 9 × 1
      id
  <fctr>
1   1010
2   1013
3   1015
4   1009
5   1008
6   1012
7   1007
8   1006

From here it's a simple matter of writing a function that will do the anti_join to leave us with just the differences between each subsequent group (though this is the part I'm not proud of and really starts to show my lack of R skills - please feel free to suggest improvements): 从这里开始,只需编写一个函数即可完成anti_join,使我们仅留有后续各组之间的差异(这是我不感到骄傲的部分,并且实际上开始显示出我缺乏R技能),这很简单随时提出改进建议):

## Function departed() - returns the id's that were dropped from each subsequent time period
departed <- function(groups) {
  tempList <- vector("list", nrow(groups))
  # Loop through the groups and do an anti_join between each
  for (i in seq(1, nrow(groups) - 1)) {
  tempList[[i + 1]] <-
  anti_join(data.frame(groups$data[[i]]),  data.frame(groups$data[[i + 1]]), by = "id")

  }
  return(tempList)
}

Applying this function to our nested data yields the list of lists of departed id's: 将此函数应用于我们的嵌套数据将产生已故ID列表列表:

> departedIDs <- dataNested %>% departed()

> departedIDs
[[1]]
NULL

[[2]]
    id
1 1011

[[3]]
    id
1 1006
2 1008
3 1009
4 1015

[[4]]
    id
1 1007

[[5]]
    id
1 1011
2 1015

I hope this answer will help others who's brain works the same way as mine. 我希望这个答案能帮助其他与我的大脑运作方式相同的人。

Just add the input stringsAsFactors = FALSE to your dataframe. 只需将输入stringsAsFactors = FALSE添加到您的数据stringsAsFactors = FALSE即可。 This would make your code run: Although am not sure whether the outputted result is the one you are looking for. 这将使您的代码运行:尽管不确定输出的结果是否是您想要的结果。 To view the whole result, pipe it into a data.frame and see whether it is what you are looking for. 要查看整个结果,请将其通过管道传输到data.frame中,然后查看其是否为您要的内容。 Hope this helps. 希望这可以帮助。

 set.seed(1)
 data <- data.frame(
    date = c(rep('2015-02-01',15), rep('2015-02-02',16), rep('2015-02-3',15)),
    id = as.character(c(1005 + sample.int(10,15,replace=TRUE), 1005 + sample.int(10,16,replace=TRUE), 1005 + sample.int(10,15,replace=TRUE))),stringsAsFactors = FALSE)


data %>%
  group_by(date) %>%
  distinct(id) %>%
  do(lost = anti_join(., lag(.), by="id"))%>%data.frame()

some manipulation on data and a merge might do what you want. 对数据进行一些操作并进行合并可能会满足您的要求。 Something like this 像这样

df <- unique(data)
df$date <- as.Date(df$date)
df$leftdate <- df$date + 1
df$prevdate <- df$date - 1
df2 <- cbind(df[,c("date","id")],flag =  1)

# merge the dataframe so that each day would attempt to join the next day
dfleft <- merge(df,df2,by.x = c("leftdate","id"),by.y = c("date","id"),all.x= TRUE)
# if there is no common id between a day and the next day, the merge returns NA, which is the desired results for those who left
dfleft <- dfleft[is.na(dfleft$flag),c("leftdate","id")]

# Here, you reverse the logic to find those who show up today but weren't there yesterday
dfnew <- merge(df,df2,by.x = c("prevdate","id"),by.y = c("date","id"),all.x= TRUE)
dfnew <- dfnew[is.na(dfnew$flag),c("date","id")]

My understanding from the question is that data shows the id's at each date and we want to iterate through the dates comparing the ids in that date to the ids in the immediately prior date. 我对这个问题的理解是,数据显示每个日期的ID,因此我们要遍历所有日期,以比较该日期的ID和前一个日期的ID。

First get the unique rows u and convert the id to numeric. 首先获取u的唯一行,并将id转换为数字。 Then split id by date giving s and define a function diffs which produces a numeric vector of added id's using negative numbers for removed id's. 然后按dateid划分为s并定义一个函数diffs ,该函数将使用删除的ID的负数生成添加ID的数字矢量。 lapply that to seq_along(s) except for the first component as it has no prior component. lapply其应用于seq_along(第一个组件除外),因为它没有先前的组件。 No packages are used. 不使用任何软件包。

u <- unique(data)
u$id <- as.numeric(as.character(u$id))
s <- split(u$id, u$date)
diffs <- function(i) c(setdiff(s[[i]], s[[i-1]]), - setdiff(s[[i-1]], s[[i]]))
diffs_list <- setNames(lapply(seq_along(s)[-1], diffs), names(s)[-1])

giving: 给予:

> diffs_list
$`2015-02-02`
[1]  1010 -1011

$`2015-02-03`
[1]  1011 -1015 -1009 -1008 -1006

or if you want a data frame as output 或者如果您想将数据框作为输出

setNames(stack(diffs_list), c("id", "date"))

giving: 给予:

     id       date
1  1010 2015-02-02
2 -1011 2015-02-02
3  1011 2015-02-03
4 -1015 2015-02-03
5 -1009 2015-02-03
6 -1008 2015-02-03
7 -1006 2015-02-03

magrittr 磁珠

This could also be expressed using the magrittr package like this where diffs is defined above. 这也可以使用magrittr包这样在那里表示diffs如上所定义。

library(magrittr)

data %>%
     unique %>%
     transform(id = as.numeric(as.character(id))) %>%
     { split(.$id, .$date) } %>%
     { setNames(lapply(seq_along(.)[-1], diffs), names(.)[-1]) }

Note: I have replaced -3 in data$date with -03. 注意:我已经用-03替换了data$date -3。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM