![](/img/trans.png)
[英]Is there a function in R tidyverse to categorize character values of a column based on key words and assign a category?
[英]R function categorize by column?
我想編寫一個函數,它接受一個數據框,計算多列的出現次數,然后根據列名稱的出現為該行分配一個“類別”。
以此df為例:
df <- data.frame(k1 = c(0,0,3,4,5,1),
k2 = c(1,0,0,4,5,0),
k3 = c(0,0,0,8,0,0),
k4 = c(2,5,0,3,4,5))
我希望輸出看起來像這樣:
df.final<-data.frame(k1 = c(0,0,3,4,5,1),
k2 = c(1,0,0,4,5,0),
k3 = c(0,0,0,8,0,0),
k4 = c(2,5,0,3,4,5),
Category = c("k2_k4","k4","k1","k1_k2_k3_k4","k1_k2_k4","k1_k4"))
當然,我的實際數據是很多很多行,我希望這個函數可以用來評估任意列數的數據幀。 我只是不確定如何編寫該函數。 我是一個寫新手的功能!
您可以使用data.table::transpose()
函數使每一行成為一個向量,然后使用sapply
循環遍歷列表並粘貼值不為零的相應列名:
df$category = sapply(data.table::transpose(df),
function(r) paste0(names(df)[r != 0], collapse = "_"))
df
# k1 k2 k3 k4 category
#1 0 1 0 2 k2_k4
#2 0 0 0 5 k4
#3 3 0 0 0 k1
#4 4 4 8 3 k1_k2_k3_k4
#5 5 5 0 4 k1_k2_k4
#6 1 0 0 5 k1_k4
在基地R,有很多選擇。 一:
df$Category <- apply(df > 0, 1, function(x){toString(names(df)[x])})
df
## k1 k2 k3 k4 Category
## 1 0 1 0 2 k2, k4
## 2 0 0 0 5 k4
## 3 3 0 0 0 k1
## 4 4 4 8 3 k1, k2, k3, k4
## 5 5 5 0 4 k1, k2, k4
## 6 1 0 0 5 k1, k4
或使用下划線,
df$Category <- apply(df > 0, 1, function(x){paste(names(df)[x], collapse = '_')})
df
## k1 k2 k3 k4 Category
## 1 0 1 0 2 k2_k4
## 2 0 0 0 5 k4
## 3 3 0 0 0 k1
## 4 4 4 8 3 k1_k2_k3_k4
## 5 5 5 0 4 k1_k2_k4
## 6 1 0 0 5 k1_k4
一種有趣的替代方法是purrr::by_row
:
library(purrr)
df %>% by_row(~toString(names(.)[.x > 0]), .collate = 'cols', .to = 'Category')
## # A tibble: 6 × 5
## k1 k2 k3 k4 Category
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 0 1 0 2 k2, k4
## 2 0 0 0 5 k4
## 3 3 0 0 0 k1
## 4 4 4 8 3 k1, k2, k3, k4
## 5 5 5 0 4 k1, k2, k4
## 6 1 0 0 5 k1, k4
df$Category = paste(ifelse(df$k1>0, 'k1_',''), ifelse(df$k2>0, 'k2_',''), ifelse(df$k3>0, 'k3_',''), ifelse(df$k4>0, 'k4_',''), sep='')
結果:
k1 k2 k3 k4 Category
1 0 1 0 2 k2_k4_
2 0 0 0 5 k4_
3 3 0 0 0 k1_
4 4 4 8 3 k1_k2_k3_k4_
5 5 5 0 4 k1_k2_k4_
6 1 0 0 5 k1_k4_
也許得到更有效的方式。 我太新手了。
使用data.table
:
library(data.table)
setDT(df)
df[ , I := .I]
df[melt(df, id.vars = "I")[value != 0,
paste(variable, collapse = "_"),
keyby = I],
Category := i.V1, on = "I"][]
# k1 k2 k3 k4 I Category
# 1: 0 1 0 2 1 k2_k4
# 2: 0 0 0 5 2 k4
# 3: 3 0 0 0 3 k1
# 4: 4 4 8 3 4 k1_k2_k3_k4
# 5: 5 5 0 4 5 k1_k2_k4
# 6: 1 0 0 5 6 k1_k4
方法草圖:
melt
數據 - 讓我們在不轉換為矩陣的情況下“按行”進行操作 我們可以在base R
中以矢量化形式執行此操作(不使用包)。
df$category <- gsub('^NA_|NA_+|_NA', '', do.call(paste,
c(as.data.frame(`dim<-`(names(df)[(NA^!df)*col(df)], dim(df))), sep="_")))
df$category
#[1] "k2_k4" "k4" "k1" "k1_k2_k3_k4" "k1_k2_k4" "k1_k4"
1)想法是將數據集轉換為邏輯向量( !df
- 為0時返回TRUE,其他值返回FALSE)
2)將TRUE值更改為NA( NA^
)
3)然后乘以列索引( col(df)
)
4)使用此索引填充列名稱
5)輸出是一個vector
,因此我們在分配原始數據集的維度后將其更改為data.frame
6) paste
與行元素do.call(paste
7)最后使用gsub
NA
替換為空白( ""
)
數據集
set.seed(24)
df <- data.frame(k1 = sample(0:5, 1e6, replace=TRUE),
k2 = sample(0:7, 1e6, replace = TRUE),
k3 = sample(0:8, 1e6, replace=TRUE),
k4 = sample(0:4, 1e6, replace = TRUE))
df2 <- copy(df)
setDT(df2)
職能
psidom <- function(){
sapply(data.table::transpose(df),
function(r) paste0(names(df)[r != 0], collapse = "_"))}
akrun <- function(){
gsub('^NA_|NA_+|_NA', '', do.call(paste,
c(as.data.frame(`dim<-`(names(df)[(NA^!df)*col(df)], dim(df))), sep="_")))
}
ae <- function(){
apply(df > 0, 1, function(x){toString(names(df)[x])})}
ae2 <- function(){
df %>%
by_row(~toString(names(.)[.x > 0]),
.collate = 'cols', .to = 'Category')
}
MC <- function(){
df2[ , I := .I]
df2[melt(df2, id.vars = "I")[value != 0,
paste(variable, collapse = "_"),
keyby = I],
Category := i.V1, on = "I"][]
}
Eric <- function() {
paste(ifelse(df$k1>0, 'k1_',''),
ifelse(df$k2>0, 'k2_',''),
ifelse(df$k3>0, 'k3_',''),
ifelse(df$k4>0, 'k4_',''), sep='')
}
基准輸出 - system.time
system.time(psidom())
# user system elapsed
# 7.91 0.06 7.97
system.time(ae())
# user system elapsed
# 10.22 0.00 10.22
system.time(ae2())
# user system elapsed
# 100.60 0.27 101.44
system.time(MC())
# user system elapsed
# 4.22 0.03 4.25
system.time(Eric())
# user system elapsed
# 1.40 0.00 1.41
system.time(akrun())
# user system elapsed
# 1.53 0.00 1.53
基准輸出 - 微基准測試
library(microbenchmark)
microbenchmark(psidom(), akrun(), ae(), ae2(), MC(), Eric(), unit = "relative",
times = 10)
#Unit: relative
# expr min lq mean median uq max neval
# psidom() 4.0824126 4.1283338 3.9332463 4.237229 3.4060509 4.2147045 10
# akrun() 1.0000000 1.0000000 1.0000000 1.000000 1.0000000 1.0000000 10
# ae() 6.7507093 6.9175511 6.0683960 6.725867 5.1087104 5.1901925 10
# ae2() 62.4294377 61.4709644 53.7637154 59.873279 44.9316386 44.9233634 10
# MC() 3.1439541 3.4666872 3.1479070 3.559120 2.7554062 2.8741309 10
# Eric() 0.9091862 0.9628939 0.9702425 1.042875 0.9878793 0.9686051 10
討論/評論
@ Eric的方法是最快的,但是當列數更多時,嵌套的ifelse
語句也會更多..
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.