[英]R data.table finding min and max within groups
Hi I have this problem where I have a dataset where a person works in different companies.嗨,我有一个数据集,其中一个人在不同的公司工作,我遇到了这个问题。 Now I want to find the duration of each company he worked for.
现在我想找到他工作的每家公司的持续时间。 Some person goes back to his previous company to work.
有些人回到他以前的公司工作。 Here is my dataset and my implementation, but it doesn't work when he goes back to his previous company later.
这是我的数据集和我的实现,但是当他以后回到他以前的公司时它不起作用。
library(data.table)
data <- data.table(person = c(1,1,1,1,1,1,1,1), company = c(1,1,1,2,2,2,1,1),
year = c(1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997))
You see person == 1 works in company 1 from 1990 to 1992 and then switched to company 2 from 1993 to 1995. Then he goes back to company 1 from 1996 to 1997.你看到 person == 1 从 1990 年到 1992 年在公司 1 工作,然后从 1993 年到 1995 年切换到公司 2。然后他从 1996 年到 1997 年回到公司 1。
I thought about using我想过使用
min <- data[data[, .I[year == min(year)], by=.(person, company)]$V1]
setnames(min, "year", "start")
max <- data[data[, .I[year == max(year)], by=.(person, company)]$V1]
setnames(max, "year", "end")
duration <- merge(min, max, all = T)
which you get:你得到:
person company start end
1 1 1990 1997
1 2 1993 1995
But what I want is:但我想要的是:
person company start end
1 1 1990 1992
1 2 1993 1995
1 1 1996 1997
Any idea how to get that?知道如何获得吗?
Thanks.谢谢。
We can use rleid
as a grouping variable我们可以使用
rleid
作为分组变量
library(data.table)
data[, .(start = min(year), end = max(year)),
.(person, grp = rleid(company), company)][, grp := NULL][]
-output -输出
person company start end
1: 1 1 1990 1992
2: 1 2 1993 1995
3: 1 1 1996 1997
Or may also use collapse
或者也可以使用
collapse
library(collapse)
data[, grp := rleid(company)]
collap(data, ~ person + company + grp, list(fmin, fmax))
person company fmin.year fmax.year grp
1: 1 1 1990 1992 1
2: 1 1 1996 1997 3
3: 1 2 1993 1995 2
There are probably better ways to do it, but here it goes:可能有更好的方法来做到这一点,但它是这样的:
library(data.table)
data = data.table(person = c(1,1,1,1,1,1,1,1), company = c(1,1,1,2,2,2,1,1),
year = c(1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997))
data[, c('start', 'end', 'group') := 0]
group_count = 0
for (i in seq_len(nrow(data))) {
if (i == 1) {
next
} else if (data[i, company] != data[i-1, company]) {
group_count = group_count + 1
data[i, group := group_count]
} else {
data[i, group := group_count]
}
}
data[, c('start', 'end') := .(min(year), max(year)), by = group]
data = unique(data[, .(person, company, start, end)])
> data
person company start end
1: 1 1 1990 1992
2: 1 2 1993 1995
3: 1 1 1996 1997
Adopting @akrun's answer采用@akrun 的回答
If your dataset is large如果你的数据集很大
data[, grp := rleid(company), by=.(person)]
min <- data[data[, .I[year == min(year)], by=.(person, company, grp)]$V1]
setnames(min, "year", "start")
max <- data[data[, .I[year == max(year)], by=.(person, company, grp)]$V1]
setnames(max, "year", "end")
duration <- merge(min, max, all = T)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.