简体   繁体   English

R data.table 在组内查找最小值和最大值

[英]R data.table finding min and max within groups

Hi I have this problem where I have a dataset where a person works in different companies.嗨,我有一个数据集,其中一个人在不同的公司工作,我遇到了这个问题。 Now I want to find the duration of each company he worked for.现在我想找到他工作的每家公司的持续时间。 Some person goes back to his previous company to work.有些人回到他以前的公司工作。 Here is my dataset and my implementation, but it doesn't work when he goes back to his previous company later.这是我的数据集和我的实现,但是当他以后回到他以前的公司时它不起作用。

library(data.table)
data <- data.table(person = c(1,1,1,1,1,1,1,1), company = c(1,1,1,2,2,2,1,1),
               year = c(1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997))

You see person == 1 works in company 1 from 1990 to 1992 and then switched to company 2 from 1993 to 1995. Then he goes back to company 1 from 1996 to 1997.你看到 person == 1 从 1990 年到 1992 年在公司 1 工作,然后从 1993 年到 1995 年切换到公司 2。然后他从 1996 年到 1997 年回到公司 1。

I thought about using我想过使用

min <- data[data[, .I[year == min(year)], by=.(person, company)]$V1]
setnames(min, "year", "start")

max <- data[data[, .I[year == max(year)], by=.(person, company)]$V1]
setnames(max, "year", "end")

duration <- merge(min, max, all = T)

which you get:你得到:

person company start  end
     1       1  1990 1997
     1       2  1993 1995

But what I want is:但我想要的是:

person company start  end
     1       1  1990 1992
     1       2  1993 1995
     1       1  1996 1997

Any idea how to get that?知道如何获得吗?

Thanks.谢谢。

We can use rleid as a grouping variable我们可以使用rleid作为分组变量

library(data.table)
data[, .(start =  min(year), end = max(year)),
    .(person, grp = rleid(company), company)][, grp := NULL][]

-output -输出

   person company start  end
1:      1       1  1990 1992
2:      1       2  1993 1995
3:      1       1  1996 1997

Or may also use collapse或者也可以使用collapse

library(collapse)
data[, grp := rleid(company)]
collap(data,  ~ person + company + grp, list(fmin, fmax))
   person company fmin.year fmax.year grp
1:      1       1      1990      1992   1
2:      1       1      1996      1997   3
3:      1       2      1993      1995   2

There are probably better ways to do it, but here it goes:可能有更好的方法来做到这一点,但它是这样的:

library(data.table)
data = data.table(person = c(1,1,1,1,1,1,1,1), company = c(1,1,1,2,2,2,1,1),
                   year = c(1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997))

data[, c('start', 'end', 'group') := 0]
group_count = 0

for (i in seq_len(nrow(data))) {
  if (i == 1) {
    next
    } else if (data[i, company] != data[i-1, company]) {
    group_count = group_count + 1
    data[i, group := group_count]
    } else {
    data[i, group := group_count]
    }
}

data[, c('start', 'end') := .(min(year), max(year)), by = group]

data = unique(data[, .(person, company, start, end)])

> data
   person company start  end
1:      1       1  1990 1992
2:      1       2  1993 1995
3:      1       1  1996 1997

Adopting @akrun's answer采用@akrun 的回答

If your dataset is large如果你的数据集很大

data[, grp := rleid(company), by=.(person)]

min <- data[data[, .I[year == min(year)], by=.(person, company, grp)]$V1]
setnames(min, "year", "start")

max <- data[data[, .I[year == max(year)], by=.(person, company, grp)]$V1]
setnames(max, "year", "end")

duration <- merge(min, max, all = T)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM