[英]more efficient way to recode groups?
My goal is to recode group_old to look like group_desired: 我的目标是重新编码group_old使其看起来像group_desired:
group_old <- c(58,58,57,57,57,56,56,56,59,59,56)
group_desired <- c(1,1,2,2,2,3,3,3,4,4,3)
df <- data.frame(group_old, group_desired)
> df
group_old group_desired
1 58 1
2 58 1
3 57 2
4 57 2
5 57 2
6 56 3
7 56 3
8 56 3
9 59 4
10 59 4
11 56 3
I was able to do it: 我能够做到:
codex <- data.frame(old = unique(df$group_old), new = 1:length(unique(df$group_old)))
df$group_new <- sapply(df$group_old, FUN = function(x) codex$new[codex$old == x] )
> df
group_old group_desired group_new
1 58 1 1
2 58 1 1
3 57 2 2
4 57 2 2
5 57 2 2
6 56 3 3
7 56 3 3
8 56 3 3
9 59 4 4
10 59 4 4
11 56 3 3
However, this code runs very slowly on a dataset with 8 millions obs and 400k groups. 但是,此代码在具有800万个obs和40万个组的数据集上运行非常缓慢。 Is there a more efficient way to accomplish the same thing for large data?
是否有一种更有效的方法来完成大数据的相同任务?
Using data.table
: 使用
data.table
:
We group by group_old
, and then create a new column by reference. 我们按
group_old
,然后按引用创建一个新列。 .GRP
is a special symbol in data.table
. .GRP
是data.table
的特殊符号。 Its a simple grouping counter. 它是一个简单的分组计数器。 It assigns 1 to the first group, 2 to the second.. and so on
它将1分配给第一个组,将2分配给第二个..依此类推
group_old <- c(58,58,57,57,57,56,56,56,59,59,56)
df <- data.frame(group_old = group_old)
library(data.table)
setDT(df)[,group_desired := .GRP, by = group_old]
# group_old group_desired
#1: 58 1
#2: 58 1
#3: 57 2
#4: 57 2
#5: 57 2
#6: 56 3
#7: 56 3
#8: 56 3
#9: 59 4
#10: 59 4
#11: 56 3
Or using dplyr
: 或使用
dplyr
:
df$group_desired <- group_indices(df, group_old)
To get a similar result as above, we first define the factor levels for group_old
: 为了获得与上述类似的结果,我们首先定义
group_old
的因子水平:
df$group_old <- factor(df$group_old, levels = unique(df$group_old))
df$group_desired <- group_indices(df, group_old)
Note : group_indices
assigns group numbers based on ascending order (in case of numbers) or factor level (if the variable used is factor). 注意 :
group_indices
根据升序(如果是数字)或因子级别(如果使用的变量是factor)分配组号。
I am not sure about performance, but you could try recode from the new version of dplyr package: 我不确定性能,但是您可以尝试从新版本的dplyr软件包重新编码:
df$group_desired <-
dplyr::recode(df$group_old, `58` = 1, `57` = 2, `56` = 3, `59` = 4)
A more general data.table approach. 更通用的数据表方法。
library(data.table)
dt1 <- data.table(old = LETTERS[1:6], new = 1:6)
set.seed(1234)
dt2 <- data.table(old = sample(LETTERS[1:6], 6, replace = TRUE))
setkey(dt1, old)
setkey(dt2, old)
dt2[dt1]
# old new
# 1: A 1
# 2: B 2
# 3: C 3
# 4: D 4
# 5: D 4
# 6: D 4
# 7: D 4
# 8: E 5
# 9: F 6
I discovered another Base R way that's a bit faster than my original: 我发现了另一种Base R方式,该方式比我原来的要快一些:
df <- within(df, { group_new <- as.numeric(as.factor(df$group_old)) } )
df <- within(df, { group_new <- match(group_new, unique(group_new)) } )
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.