I would like to write a function that takes a data frame, counts occurrences across multiple columns, and then assigns the row with a "Category" based on column name occurrence.
Taking this df as an example:
df <- data.frame(k1 = c(0,0,3,4,5,1),
k2 = c(1,0,0,4,5,0),
k3 = c(0,0,0,8,0,0),
k4 = c(2,5,0,3,4,5))
I'd like the output to look like this:
df.final<-data.frame(k1 = c(0,0,3,4,5,1),
k2 = c(1,0,0,4,5,0),
k3 = c(0,0,0,8,0,0),
k4 = c(2,5,0,3,4,5),
Category = c("k2_k4","k4","k1","k1_k2_k3_k4","k1_k2_k4","k1_k4"))
Of course, my actual data is many, many more lines and I'm hoping this function can be used to evaluate data frames with any number of columns. I'm just not sure how to write the function. I'm a function writing newbie!
You can use data.table::transpose()
function to make each row a vector, then use sapply
to loop through the list and paste corresponding column names where the values are not zero:
df$category = sapply(data.table::transpose(df),
function(r) paste0(names(df)[r != 0], collapse = "_"))
df
# k1 k2 k3 k4 category
#1 0 1 0 2 k2_k4
#2 0 0 0 5 k4
#3 3 0 0 0 k1
#4 4 4 8 3 k1_k2_k3_k4
#5 5 5 0 4 k1_k2_k4
#6 1 0 0 5 k1_k4
In base R, there are a lot of options. One:
df$Category <- apply(df > 0, 1, function(x){toString(names(df)[x])})
df
## k1 k2 k3 k4 Category
## 1 0 1 0 2 k2, k4
## 2 0 0 0 5 k4
## 3 3 0 0 0 k1
## 4 4 4 8 3 k1, k2, k3, k4
## 5 5 5 0 4 k1, k2, k4
## 6 1 0 0 5 k1, k4
or to use underscores,
df$Category <- apply(df > 0, 1, function(x){paste(names(df)[x], collapse = '_')})
df
## k1 k2 k3 k4 Category
## 1 0 1 0 2 k2_k4
## 2 0 0 0 5 k4
## 3 3 0 0 0 k1
## 4 4 4 8 3 k1_k2_k3_k4
## 5 5 5 0 4 k1_k2_k4
## 6 1 0 0 5 k1_k4
A sort of interesting alternative is purrr::by_row
:
library(purrr)
df %>% by_row(~toString(names(.)[.x > 0]), .collate = 'cols', .to = 'Category')
## # A tibble: 6 × 5
## k1 k2 k3 k4 Category
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 0 1 0 2 k2, k4
## 2 0 0 0 5 k4
## 3 3 0 0 0 k1
## 4 4 4 8 3 k1, k2, k3, k4
## 5 5 5 0 4 k1, k2, k4
## 6 1 0 0 5 k1, k4
df$Category = paste(ifelse(df$k1>0, 'k1_',''), ifelse(df$k2>0, 'k2_',''), ifelse(df$k3>0, 'k3_',''), ifelse(df$k4>0, 'k4_',''), sep='')
result:
k1 k2 k3 k4 Category
1 0 1 0 2 k2_k4_
2 0 0 0 5 k4_
3 3 0 0 0 k1_
4 4 4 8 3 k1_k2_k3_k4_
5 5 5 0 4 k1_k2_k4_
6 1 0 0 5 k1_k4_
Maybe got more efficient way. I'm too newbie.
Using data.table
:
library(data.table)
setDT(df)
df[ , I := .I]
df[melt(df, id.vars = "I")[value != 0,
paste(variable, collapse = "_"),
keyby = I],
Category := i.V1, on = "I"][]
# k1 k2 k3 k4 I Category
# 1: 0 1 0 2 1 k2_k4
# 2: 0 0 0 5 2 k4
# 3: 3 0 0 0 3 k1
# 4: 4 4 8 3 4 k1_k2_k3_k4
# 5: 5 5 0 4 5 k1_k2_k4
# 6: 1 0 0 5 6 k1_k4
Sketch of approach:
melt
the data -- lets us do things "row-wise" without converting to matrix We can do this in a vectorized form in base R
(no packages used).
df$category <- gsub('^NA_|NA_+|_NA', '', do.call(paste,
c(as.data.frame(`dim<-`(names(df)[(NA^!df)*col(df)], dim(df))), sep="_")))
df$category
#[1] "k2_k4" "k4" "k1" "k1_k2_k3_k4" "k1_k2_k4" "k1_k4"
1) The idea is convert the dataset into a logical vector ( !df
- returns TRUE for 0 and FALSE for other values)
2) Change the TRUE values to NA ( NA^
)
3) Then multiply with the column index ( col(df)
)
4) Use this index to populate the column names
5) The output is a vector
, so we change that to data.frame
after assigning the dimensions of the original dataset
6) paste
the row elements with do.call(paste
7) Finally replace the strings with NA
to blank ( ""
) using gsub
Dataset
set.seed(24)
df <- data.frame(k1 = sample(0:5, 1e6, replace=TRUE),
k2 = sample(0:7, 1e6, replace = TRUE),
k3 = sample(0:8, 1e6, replace=TRUE),
k4 = sample(0:4, 1e6, replace = TRUE))
df2 <- copy(df)
setDT(df2)
Functions
psidom <- function(){
sapply(data.table::transpose(df),
function(r) paste0(names(df)[r != 0], collapse = "_"))}
akrun <- function(){
gsub('^NA_|NA_+|_NA', '', do.call(paste,
c(as.data.frame(`dim<-`(names(df)[(NA^!df)*col(df)], dim(df))), sep="_")))
}
ae <- function(){
apply(df > 0, 1, function(x){toString(names(df)[x])})}
ae2 <- function(){
df %>%
by_row(~toString(names(.)[.x > 0]),
.collate = 'cols', .to = 'Category')
}
MC <- function(){
df2[ , I := .I]
df2[melt(df2, id.vars = "I")[value != 0,
paste(variable, collapse = "_"),
keyby = I],
Category := i.V1, on = "I"][]
}
Eric <- function() {
paste(ifelse(df$k1>0, 'k1_',''),
ifelse(df$k2>0, 'k2_',''),
ifelse(df$k3>0, 'k3_',''),
ifelse(df$k4>0, 'k4_',''), sep='')
}
Benchmark Output - system.time
system.time(psidom())
# user system elapsed
# 7.91 0.06 7.97
system.time(ae())
# user system elapsed
# 10.22 0.00 10.22
system.time(ae2())
# user system elapsed
# 100.60 0.27 101.44
system.time(MC())
# user system elapsed
# 4.22 0.03 4.25
system.time(Eric())
# user system elapsed
# 1.40 0.00 1.41
system.time(akrun())
# user system elapsed
# 1.53 0.00 1.53
Benchmark Output - microbenchmark
library(microbenchmark)
microbenchmark(psidom(), akrun(), ae(), ae2(), MC(), Eric(), unit = "relative",
times = 10)
#Unit: relative
# expr min lq mean median uq max neval
# psidom() 4.0824126 4.1283338 3.9332463 4.237229 3.4060509 4.2147045 10
# akrun() 1.0000000 1.0000000 1.0000000 1.000000 1.0000000 1.0000000 10
# ae() 6.7507093 6.9175511 6.0683960 6.725867 5.1087104 5.1901925 10
# ae2() 62.4294377 61.4709644 53.7637154 59.873279 44.9316386 44.9233634 10
# MC() 3.1439541 3.4666872 3.1479070 3.559120 2.7554062 2.8741309 10
# Eric() 0.9091862 0.9628939 0.9702425 1.042875 0.9878793 0.9686051 10
Discussion/Comments
@Eric's method is the fastest, but when the number of columns are more, then nested ifelse
statements will also be more..
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.