[英]R function categorize by column?
I would like to write a function that takes a data frame, counts occurrences across multiple columns, and then assigns the row with a "Category" based on column name occurrence. 我想编写一个函数,它接受一个数据框,计算多列的出现次数,然后根据列名称的出现为该行分配一个“类别”。
Taking this df as an example: 以此df为例:
df <- data.frame(k1 = c(0,0,3,4,5,1),
k2 = c(1,0,0,4,5,0),
k3 = c(0,0,0,8,0,0),
k4 = c(2,5,0,3,4,5))
I'd like the output to look like this: 我希望输出看起来像这样:
df.final<-data.frame(k1 = c(0,0,3,4,5,1),
k2 = c(1,0,0,4,5,0),
k3 = c(0,0,0,8,0,0),
k4 = c(2,5,0,3,4,5),
Category = c("k2_k4","k4","k1","k1_k2_k3_k4","k1_k2_k4","k1_k4"))
Of course, my actual data is many, many more lines and I'm hoping this function can be used to evaluate data frames with any number of columns. 当然,我的实际数据是很多很多行,我希望这个函数可以用来评估任意列数的数据帧。 I'm just not sure how to write the function.
我只是不确定如何编写该函数。 I'm a function writing newbie!
我是一个写新手的功能!
You can use data.table::transpose()
function to make each row a vector, then use sapply
to loop through the list and paste corresponding column names where the values are not zero: 您可以使用
data.table::transpose()
函数使每一行成为一个向量,然后使用sapply
循环遍历列表并粘贴值不为零的相应列名:
df$category = sapply(data.table::transpose(df),
function(r) paste0(names(df)[r != 0], collapse = "_"))
df
# k1 k2 k3 k4 category
#1 0 1 0 2 k2_k4
#2 0 0 0 5 k4
#3 3 0 0 0 k1
#4 4 4 8 3 k1_k2_k3_k4
#5 5 5 0 4 k1_k2_k4
#6 1 0 0 5 k1_k4
In base R, there are a lot of options. 在基地R,有很多选择。 One:
一:
df$Category <- apply(df > 0, 1, function(x){toString(names(df)[x])})
df
## k1 k2 k3 k4 Category
## 1 0 1 0 2 k2, k4
## 2 0 0 0 5 k4
## 3 3 0 0 0 k1
## 4 4 4 8 3 k1, k2, k3, k4
## 5 5 5 0 4 k1, k2, k4
## 6 1 0 0 5 k1, k4
or to use underscores, 或使用下划线,
df$Category <- apply(df > 0, 1, function(x){paste(names(df)[x], collapse = '_')})
df
## k1 k2 k3 k4 Category
## 1 0 1 0 2 k2_k4
## 2 0 0 0 5 k4
## 3 3 0 0 0 k1
## 4 4 4 8 3 k1_k2_k3_k4
## 5 5 5 0 4 k1_k2_k4
## 6 1 0 0 5 k1_k4
A sort of interesting alternative is purrr::by_row
: 一种有趣的替代方法是
purrr::by_row
:
library(purrr)
df %>% by_row(~toString(names(.)[.x > 0]), .collate = 'cols', .to = 'Category')
## # A tibble: 6 × 5
## k1 k2 k3 k4 Category
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 0 1 0 2 k2, k4
## 2 0 0 0 5 k4
## 3 3 0 0 0 k1
## 4 4 4 8 3 k1, k2, k3, k4
## 5 5 5 0 4 k1, k2, k4
## 6 1 0 0 5 k1, k4
df$Category = paste(ifelse(df$k1>0, 'k1_',''), ifelse(df$k2>0, 'k2_',''), ifelse(df$k3>0, 'k3_',''), ifelse(df$k4>0, 'k4_',''), sep='')
result: 结果:
k1 k2 k3 k4 Category
1 0 1 0 2 k2_k4_
2 0 0 0 5 k4_
3 3 0 0 0 k1_
4 4 4 8 3 k1_k2_k3_k4_
5 5 5 0 4 k1_k2_k4_
6 1 0 0 5 k1_k4_
Maybe got more efficient way. 也许得到更有效的方式。 I'm too newbie.
我太新手了。
Using data.table
: 使用
data.table
:
library(data.table)
setDT(df)
df[ , I := .I]
df[melt(df, id.vars = "I")[value != 0,
paste(variable, collapse = "_"),
keyby = I],
Category := i.V1, on = "I"][]
# k1 k2 k3 k4 I Category
# 1: 0 1 0 2 1 k2_k4
# 2: 0 0 0 5 2 k4
# 3: 3 0 0 0 3 k1
# 4: 4 4 8 3 4 k1_k2_k3_k4
# 5: 5 5 0 4 5 k1_k2_k4
# 6: 1 0 0 5 6 k1_k4
Sketch of approach: 方法草图:
melt
the data -- lets us do things "row-wise" without converting to matrix melt
数据 - 让我们在不转换为矩阵的情况下“按行”进行操作 We can do this in a vectorized form in base R
(no packages used). 我们可以在
base R
中以矢量化形式执行此操作(不使用包)。
df$category <- gsub('^NA_|NA_+|_NA', '', do.call(paste,
c(as.data.frame(`dim<-`(names(df)[(NA^!df)*col(df)], dim(df))), sep="_")))
df$category
#[1] "k2_k4" "k4" "k1" "k1_k2_k3_k4" "k1_k2_k4" "k1_k4"
1) The idea is convert the dataset into a logical vector ( !df
- returns TRUE for 0 and FALSE for other values) 1)想法是将数据集转换为逻辑向量(
!df
- 为0时返回TRUE,其他值返回FALSE)
2) Change the TRUE values to NA ( NA^
) 2)将TRUE值更改为NA(
NA^
)
3) Then multiply with the column index ( col(df)
) 3)然后乘以列索引(
col(df)
)
4) Use this index to populate the column names 4)使用此索引填充列名称
5) The output is a vector
, so we change that to data.frame
after assigning the dimensions of the original dataset 5)输出是一个
vector
,因此我们在分配原始数据集的维度后将其更改为data.frame
6) paste
the row elements with do.call(paste
6)
paste
与行元素do.call(paste
7) Finally replace the strings with NA
to blank ( ""
) using gsub
7)最后使用
gsub
NA
替换为空白( ""
)
Dataset 数据集
set.seed(24)
df <- data.frame(k1 = sample(0:5, 1e6, replace=TRUE),
k2 = sample(0:7, 1e6, replace = TRUE),
k3 = sample(0:8, 1e6, replace=TRUE),
k4 = sample(0:4, 1e6, replace = TRUE))
df2 <- copy(df)
setDT(df2)
Functions 职能
psidom <- function(){
sapply(data.table::transpose(df),
function(r) paste0(names(df)[r != 0], collapse = "_"))}
akrun <- function(){
gsub('^NA_|NA_+|_NA', '', do.call(paste,
c(as.data.frame(`dim<-`(names(df)[(NA^!df)*col(df)], dim(df))), sep="_")))
}
ae <- function(){
apply(df > 0, 1, function(x){toString(names(df)[x])})}
ae2 <- function(){
df %>%
by_row(~toString(names(.)[.x > 0]),
.collate = 'cols', .to = 'Category')
}
MC <- function(){
df2[ , I := .I]
df2[melt(df2, id.vars = "I")[value != 0,
paste(variable, collapse = "_"),
keyby = I],
Category := i.V1, on = "I"][]
}
Eric <- function() {
paste(ifelse(df$k1>0, 'k1_',''),
ifelse(df$k2>0, 'k2_',''),
ifelse(df$k3>0, 'k3_',''),
ifelse(df$k4>0, 'k4_',''), sep='')
}
Benchmark Output - system.time 基准输出 - system.time
system.time(psidom())
# user system elapsed
# 7.91 0.06 7.97
system.time(ae())
# user system elapsed
# 10.22 0.00 10.22
system.time(ae2())
# user system elapsed
# 100.60 0.27 101.44
system.time(MC())
# user system elapsed
# 4.22 0.03 4.25
system.time(Eric())
# user system elapsed
# 1.40 0.00 1.41
system.time(akrun())
# user system elapsed
# 1.53 0.00 1.53
Benchmark Output - microbenchmark 基准输出 - 微基准测试
library(microbenchmark)
microbenchmark(psidom(), akrun(), ae(), ae2(), MC(), Eric(), unit = "relative",
times = 10)
#Unit: relative
# expr min lq mean median uq max neval
# psidom() 4.0824126 4.1283338 3.9332463 4.237229 3.4060509 4.2147045 10
# akrun() 1.0000000 1.0000000 1.0000000 1.000000 1.0000000 1.0000000 10
# ae() 6.7507093 6.9175511 6.0683960 6.725867 5.1087104 5.1901925 10
# ae2() 62.4294377 61.4709644 53.7637154 59.873279 44.9316386 44.9233634 10
# MC() 3.1439541 3.4666872 3.1479070 3.559120 2.7554062 2.8741309 10
# Eric() 0.9091862 0.9628939 0.9702425 1.042875 0.9878793 0.9686051 10
Discussion/Comments 讨论/评论
@Eric's method is the fastest, but when the number of columns are more, then nested ifelse
statements will also be more.. @ Eric的方法是最快的,但是当列数更多时,嵌套的
ifelse
语句也会更多..
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.