R function categorize by column?

Question

I would like to write a function that takes a data frame, counts occurrences across multiple columns, and then assigns the row with a "Category" based on column name occurrence.

Taking this df as an example:

df <- data.frame(k1 = c(0,0,3,4,5,1), 
                 k2 = c(1,0,0,4,5,0), 
                 k3 = c(0,0,0,8,0,0), 
                 k4 = c(2,5,0,3,4,5))

I'd like the output to look like this:

df.final<-data.frame(k1 = c(0,0,3,4,5,1), 
                     k2 = c(1,0,0,4,5,0), 
                     k3 = c(0,0,0,8,0,0), 
                     k4 = c(2,5,0,3,4,5), 
                     Category = c("k2_k4","k4","k1","k1_k2_k3_k4","k1_k2_k4","k1_k4"))

Of course, my actual data is many, many more lines and I'm hoping this function can be used to evaluate data frames with any number of columns. I'm just not sure how to write the function. I'm a function writing newbie!

Answer 1

You can use data.table::transpose() function to make each row a vector, then use sapply to loop through the list and paste corresponding column names where the values are not zero:

df$category = sapply(data.table::transpose(df), 
                     function(r) paste0(names(df)[r != 0], collapse = "_"))

df
#  k1 k2 k3 k4    category
#1  0  1  0  2       k2_k4
#2  0  0  0  5          k4
#3  3  0  0  0          k1
#4  4  4  8  3 k1_k2_k3_k4
#5  5  5  0  4    k1_k2_k4
#6  1  0  0  5       k1_k4

Answer 2

In base R, there are a lot of options. One:

df$Category <- apply(df > 0, 1, function(x){toString(names(df)[x])})

df
##   k1 k2 k3 k4       Category
## 1  0  1  0  2         k2, k4
## 2  0  0  0  5             k4
## 3  3  0  0  0             k1
## 4  4  4  8  3 k1, k2, k3, k4
## 5  5  5  0  4     k1, k2, k4
## 6  1  0  0  5         k1, k4

or to use underscores,

df$Category <- apply(df > 0, 1, function(x){paste(names(df)[x], collapse = '_')})

df
##   k1 k2 k3 k4    Category
## 1  0  1  0  2       k2_k4
## 2  0  0  0  5          k4
## 3  3  0  0  0          k1
## 4  4  4  8  3 k1_k2_k3_k4
## 5  5  5  0  4    k1_k2_k4
## 6  1  0  0  5       k1_k4

A sort of interesting alternative is purrr::by_row :

library(purrr)

df %>% by_row(~toString(names(.)[.x > 0]), .collate = 'cols', .to = 'Category')

## # A tibble: 6 × 5
##      k1    k2    k3    k4       Category
##   <dbl> <dbl> <dbl> <dbl>          <chr>
## 1     0     1     0     2         k2, k4
## 2     0     0     0     5             k4
## 3     3     0     0     0             k1
## 4     4     4     8     3 k1, k2, k3, k4
## 5     5     5     0     4     k1, k2, k4
## 6     1     0     0     5         k1, k4

Answer 3

df$Category = paste(ifelse(df$k1>0, 'k1_',''), ifelse(df$k2>0, 'k2_',''), ifelse(df$k3>0, 'k3_',''), ifelse(df$k4>0, 'k4_',''), sep='')

result:

  k1 k2 k3 k4     Category
1  0  1  0  2       k2_k4_
2  0  0  0  5          k4_
3  3  0  0  0          k1_
4  4  4  8  3 k1_k2_k3_k4_
5  5  5  0  4    k1_k2_k4_
6  1  0  0  5       k1_k4_

Maybe got more efficient way. I'm too newbie.

Answer 4

Using data.table :

library(data.table)
setDT(df)

df[ , I := .I]

df[melt(df, id.vars = "I")[value != 0,
                           paste(variable, collapse = "_"),
                           keyby = I],
   Category := i.V1, on = "I"][]
#    k1 k2 k3 k4 I    Category
# 1:  0  1  0  2 1       k2_k4
# 2:  0  0  0  5 2          k4
# 3:  3  0  0  0 3          k1
# 4:  4  4  8  3 4 k1_k2_k3_k4
# 5:  5  5  0  4 5    k1_k2_k4
# 6:  1  0  0  5 6       k1_k4

Sketch of approach:

Add row ID for keeping track of it
melt the data -- lets us do things "row-wise" without converting to matrix
eliminate "empty" row/column combinations
within each row ID, paste together all remaining column names
merge this back to the original data

Answer 5

We can do this in a vectorized form in base R (no packages used).

df$category <- gsub('^NA_|NA_+|_NA', '', do.call(paste, 
      c(as.data.frame(`dim<-`(names(df)[(NA^!df)*col(df)], dim(df))), sep="_")))
df$category
#[1] "k2_k4"       "k4"          "k1"          "k1_k2_k3_k4" "k1_k2_k4"    "k1_k4"

Explanation

1) The idea is convert the dataset into a logical vector ( !df - returns TRUE for 0 and FALSE for other values)

2) Change the TRUE values to NA ( NA^ )

3) Then multiply with the column index ( col(df) )

4) Use this index to populate the column names

5) The output is a vector , so we change that to data.frame after assigning the dimensions of the original dataset

6) paste the row elements with do.call(paste

7) Finally replace the strings with NA to blank ( "" ) using gsub

Benchmarks

Dataset

set.seed(24)
df <- data.frame(k1 = sample(0:5, 1e6, replace=TRUE),
                 k2 = sample(0:7, 1e6, replace = TRUE),
                 k3 = sample(0:8, 1e6, replace=TRUE),
                 k4 = sample(0:4, 1e6, replace = TRUE))

df2 <- copy(df)
setDT(df2)

Functions

psidom <- function(){
         sapply(data.table::transpose(df), 
                     function(r) paste0(names(df)[r != 0], collapse = "_"))}

akrun <- function(){
     gsub('^NA_|NA_+|_NA', '', do.call(paste, 
      c(as.data.frame(`dim<-`(names(df)[(NA^!df)*col(df)], dim(df))), sep="_")))
    }

ae <- function(){
     apply(df > 0, 1, function(x){toString(names(df)[x])})}

ae2 <- function(){
 df %>%
     by_row(~toString(names(.)[.x > 0]), 
     .collate = 'cols', .to = 'Category')
 }

MC <- function(){
    df2[ , I := .I]

df2[melt(df2, id.vars = "I")[value != 0,
                           paste(variable, collapse = "_"),
                           keyby = I],
   Category := i.V1, on = "I"][]
  }

Eric <- function() {
         paste(ifelse(df$k1>0, 'k1_',''),
          ifelse(df$k2>0, 'k2_',''),
          ifelse(df$k3>0, 'k3_',''),
          ifelse(df$k4>0, 'k4_',''), sep='')
    }

Benchmark Output - system.time

system.time(psidom())
#   user  system elapsed 
#   7.91    0.06    7.97 

system.time(ae())
#   user  system elapsed 
#  10.22    0.00   10.22 

system.time(ae2())
#   user  system elapsed 
# 100.60    0.27  101.44 


system.time(MC())
#   user  system elapsed 
#   4.22    0.03    4.25 

system.time(Eric())
#   user  system elapsed 
#   1.40    0.00    1.41 

system.time(akrun())
#   user  system elapsed 
#   1.53    0.00    1.53

Benchmark Output - microbenchmark

library(microbenchmark)
microbenchmark(psidom(), akrun(), ae(), ae2(), MC(), Eric(), unit = "relative",
       times = 10)
#Unit: relative
#     expr        min         lq       mean    median         uq        max neval
# psidom()  4.0824126  4.1283338  3.9332463  4.237229  3.4060509  4.2147045    10
#  akrun()  1.0000000  1.0000000  1.0000000  1.000000  1.0000000  1.0000000    10
#     ae()  6.7507093  6.9175511  6.0683960  6.725867  5.1087104  5.1901925    10
#    ae2() 62.4294377 61.4709644 53.7637154 59.873279 44.9316386 44.9233634    10
#     MC()  3.1439541  3.4666872  3.1479070  3.559120  2.7554062  2.8741309    10
#   Eric()  0.9091862  0.9628939  0.9702425  1.042875  0.9878793  0.9686051    10

Discussion/Comments

@Eric's method is the fastest, but when the number of columns are more, then nested ifelse statements will also be more..

R function categorize by column?

Question

5 answers

solution1
3 ACCPTED 2017-01-20 02:40:42

solution2
3 2017-01-20 02:43:25

solution3
2 2017-01-20 02:29:22

solution4
2 2017-01-20 02:36:19

solution5
2 2017-01-20 03:45:33

Explanation

Benchmarks

R function categorize by column?

Question

5 answers

solution1 3 ACCPTED 2017-01-20 02:40:42

solution2 3 2017-01-20 02:43:25

solution3 2 2017-01-20 02:29:22

solution4 2 2017-01-20 02:36:19

solution5 2 2017-01-20 03:45:33

Explanation

Benchmarks

solution1
3 ACCPTED 2017-01-20 02:40:42

solution2
3 2017-01-20 02:43:25

solution3
2 2017-01-20 02:29:22

solution4
2 2017-01-20 02:36:19

solution5
2 2017-01-20 03:45:33