[英]How can I rank observations in-group faster?
I have a really simple problem, but I'm probably not thinking vector-y enough to solve it efficiently.我有一个非常简单的问题,但我可能没有想到足够有效地解决它。 I tried two different approaches and they've been looping on two different computers for a long time now.
我尝试了两种不同的方法,它们已经在两台不同的计算机上循环了很长时间。 I wish I could say the competition made it more exciting, but... bleh.
我希望我可以说比赛让它更令人兴奋,但是...... bleh。
I have long data (many rows per person, one row per person-observation) and I basically want a variable, that tells me how often the person has been observed already.我有很长的数据(每人多行,每人观察一行),我基本上想要一个变量,它告诉我已经观察到这个人的频率。
I have the first two columns and want the third one:我有前两列,想要第三列:
person wave obs
pers1 1999 1
pers1 2000 2
pers1 2003 3
pers2 1998 1
pers2 2001 2
Now I'm using two loop-approaches.现在我使用两种循环方法。 Both are excruciatingly slow (150k rows).
两者都非常缓慢(150k 行)。 I'm sure I'm missing something, but my search queries didn't really help me yet (hard to phrase the problem).
我确定我遗漏了一些东西,但我的搜索查询并没有真正帮助我(很难说出这个问题)。
Thanks for any pointers!感谢您的任何指点!
# ordered dataset by persnr and year of observation
person.obs <- person.obs[order(person.obs$PERSNR,person.obs$wave) , ]
person.obs$n.obs = 0
# first approach: loop through people and assign range
unp = unique(person.obs$PERSNR)
unplength = length(unp)
for(i in 1:unplength) {
print(unp[i])
person.obs[which(person.obs$PERSNR==unp[i]),]$n.obs =
1:length(person.obs[which(person.obs$PERSNR==unp[i]),]$n.obs)
i=i+1
gc()
}
# second approach: loop through rows and reset counter at new person
pnr = 0
for(i in 1:length(person.obs[,2])) {
if(pnr!=person.obs[i,]$PERSNR) { pnr = person.obs[i,]$PERSNR
e = 0
}
e=e+1
person.obs[i,]$n.obs = e
i=i+1
gc()
}
The answer from Marek in this question has proven very useful in the past. Marek 在这个问题上的回答在过去被证明非常有用。 I wrote it down and use it almost daily since it was fast and efficient.
我把它写下来并几乎每天都使用它,因为它既快速又高效。 We'll use
ave()
and seq_along()
.我们将使用
ave()
和seq_along()
。
foo <-data.frame(person=c(rep("pers1",3),rep("pers2",2)),year=c(1999,2000,2003,1998,2011))
foo <- transform(foo, obs = ave(rep(NA, nrow(foo)), person, FUN = seq_along))
foo
person year obs
1 pers1 1999 1
2 pers1 2000 2
3 pers1 2003 3
4 pers2 1998 1
5 pers2 2011 2
Another option using plyr
使用
plyr
另一种选择
library(plyr)
ddply(foo, "person", transform, obs2 = seq_along(person))
person year obs obs2
1 pers1 1999 1 1
2 pers1 2000 2 2
3 pers1 2003 3 3
4 pers2 1998 1 1
5 pers2 2011 2 2
A few alternatives with the data.table and dplyr packages. data.table和dplyr封装的一些替代方案。
data.table: data.table:
library(data.table)
# setDT(foo) is needed to convert to a data.table
# option 1:
setDT(foo)[, rn := rowid(person)]
# option 2:
setDT(foo)[, rn := 1:.N, by = person]
both give:两者都给出:
> foo person year rn 1: pers1 1999 1 2: pers1 2000 2 3: pers1 2003 3 4: pers2 1998 1 5: pers2 2011 2
If you want a true rank, you should use the frank
function:如果你想要一个真实的排名,你应该使用
frank
function:
setDT(foo)[, rn := frank(year, ties.method = 'dense'), by = person]
dplyr: dplyr:
library(dplyr)
# method 1
foo <- foo %>% group_by(person) %>% mutate(rn = row_number())
# method 2
foo <- foo %>% group_by(person) %>% mutate(rn = 1:n())
both giving a similar result:两者都给出了类似的结果:
> foo Source: local data frame [5 x 3] Groups: person [2] person year rn (fctr) (dbl) (int) 1 pers1 1999 1 2 pers1 2000 2 3 pers1 2003 3 4 pers2 1998 1 5 pers2 2011 2
Would by
do the trick?会做
by
伎俩?
> foo <-data.frame(person=c(rep("pers1",3),rep("pers2",2)),year=c(1999,2000,2003,1998,2011),obs=c(1,2,3,1,2))
> foo
person year obs
1 pers1 1999 1
2 pers1 2000 2
3 pers1 2003 3
4 pers2 1998 1
5 pers2 2011 2
> by(foo, foo$person, nrow)
foo$person: pers1
[1] 3
------------------------------------------------------------
foo$person: pers2
[1] 2
Another option using aggregate
and rank
in base R:在基础 R 中使用
aggregate
和rank
的另一个选项:
foo$obs <- unlist(aggregate(.~person, foo, rank)[,2])
# person year obs
# 1 pers1 1999 1
# 2 pers1 2000 2
# 3 pers1 2003 3
# 4 pers2 1998 1
# 5 pers2 2011 2
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.