简体   繁体   中英

R function categorize by column?

I would like to write a function that takes a data frame, counts occurrences across multiple columns, and then assigns the row with a "Category" based on column name occurrence.

Taking this df as an example:

df <- data.frame(k1 = c(0,0,3,4,5,1), 
                 k2 = c(1,0,0,4,5,0), 
                 k3 = c(0,0,0,8,0,0), 
                 k4 = c(2,5,0,3,4,5))

I'd like the output to look like this:

df.final<-data.frame(k1 = c(0,0,3,4,5,1), 
                     k2 = c(1,0,0,4,5,0), 
                     k3 = c(0,0,0,8,0,0), 
                     k4 = c(2,5,0,3,4,5), 
                     Category = c("k2_k4","k4","k1","k1_k2_k3_k4","k1_k2_k4","k1_k4"))

Of course, my actual data is many, many more lines and I'm hoping this function can be used to evaluate data frames with any number of columns. I'm just not sure how to write the function. I'm a function writing newbie!

You can use data.table::transpose() function to make each row a vector, then use sapply to loop through the list and paste corresponding column names where the values are not zero:

df$category = sapply(data.table::transpose(df), 
                     function(r) paste0(names(df)[r != 0], collapse = "_"))

df
#  k1 k2 k3 k4    category
#1  0  1  0  2       k2_k4
#2  0  0  0  5          k4
#3  3  0  0  0          k1
#4  4  4  8  3 k1_k2_k3_k4
#5  5  5  0  4    k1_k2_k4
#6  1  0  0  5       k1_k4

In base R, there are a lot of options. One:

df$Category <- apply(df > 0, 1, function(x){toString(names(df)[x])})

df
##   k1 k2 k3 k4       Category
## 1  0  1  0  2         k2, k4
## 2  0  0  0  5             k4
## 3  3  0  0  0             k1
## 4  4  4  8  3 k1, k2, k3, k4
## 5  5  5  0  4     k1, k2, k4
## 6  1  0  0  5         k1, k4

or to use underscores,

df$Category <- apply(df > 0, 1, function(x){paste(names(df)[x], collapse = '_')})

df
##   k1 k2 k3 k4    Category
## 1  0  1  0  2       k2_k4
## 2  0  0  0  5          k4
## 3  3  0  0  0          k1
## 4  4  4  8  3 k1_k2_k3_k4
## 5  5  5  0  4    k1_k2_k4
## 6  1  0  0  5       k1_k4

A sort of interesting alternative is purrr::by_row :

library(purrr)

df %>% by_row(~toString(names(.)[.x > 0]), .collate = 'cols', .to = 'Category')

## # A tibble: 6 × 5
##      k1    k2    k3    k4       Category
##   <dbl> <dbl> <dbl> <dbl>          <chr>
## 1     0     1     0     2         k2, k4
## 2     0     0     0     5             k4
## 3     3     0     0     0             k1
## 4     4     4     8     3 k1, k2, k3, k4
## 5     5     5     0     4     k1, k2, k4
## 6     1     0     0     5         k1, k4
df$Category = paste(ifelse(df$k1>0, 'k1_',''), ifelse(df$k2>0, 'k2_',''), ifelse(df$k3>0, 'k3_',''), ifelse(df$k4>0, 'k4_',''), sep='')

result:

  k1 k2 k3 k4     Category
1  0  1  0  2       k2_k4_
2  0  0  0  5          k4_
3  3  0  0  0          k1_
4  4  4  8  3 k1_k2_k3_k4_
5  5  5  0  4    k1_k2_k4_
6  1  0  0  5       k1_k4_

Maybe got more efficient way. I'm too newbie.

Using data.table :

library(data.table)
setDT(df)

df[ , I := .I]

df[melt(df, id.vars = "I")[value != 0,
                           paste(variable, collapse = "_"),
                           keyby = I],
   Category := i.V1, on = "I"][]
#    k1 k2 k3 k4 I    Category
# 1:  0  1  0  2 1       k2_k4
# 2:  0  0  0  5 2          k4
# 3:  3  0  0  0 3          k1
# 4:  4  4  8  3 4 k1_k2_k3_k4
# 5:  5  5  0  4 5    k1_k2_k4
# 6:  1  0  0  5 6       k1_k4

Sketch of approach:

  • Add row ID for keeping track of it
  • melt the data -- lets us do things "row-wise" without converting to matrix
  • eliminate "empty" row/column combinations
  • within each row ID, paste together all remaining column names
  • merge this back to the original data

We can do this in a vectorized form in base R (no packages used).

df$category <- gsub('^NA_|NA_+|_NA', '', do.call(paste, 
      c(as.data.frame(`dim<-`(names(df)[(NA^!df)*col(df)], dim(df))), sep="_")))
df$category
#[1] "k2_k4"       "k4"          "k1"          "k1_k2_k3_k4" "k1_k2_k4"    "k1_k4"  

Explanation

1) The idea is convert the dataset into a logical vector ( !df - returns TRUE for 0 and FALSE for other values)

2) Change the TRUE values to NA ( NA^ )

3) Then multiply with the column index ( col(df) )

4) Use this index to populate the column names

5) The output is a vector , so we change that to data.frame after assigning the dimensions of the original dataset

6) paste the row elements with do.call(paste

7) Finally replace the strings with NA to blank ( "" ) using gsub

Benchmarks

Dataset

set.seed(24)
df <- data.frame(k1 = sample(0:5, 1e6, replace=TRUE),
                 k2 = sample(0:7, 1e6, replace = TRUE),
                 k3 = sample(0:8, 1e6, replace=TRUE),
                 k4 = sample(0:4, 1e6, replace = TRUE))

df2 <- copy(df)
setDT(df2)

Functions

psidom <- function(){
         sapply(data.table::transpose(df), 
                     function(r) paste0(names(df)[r != 0], collapse = "_"))}

akrun <- function(){
     gsub('^NA_|NA_+|_NA', '', do.call(paste, 
      c(as.data.frame(`dim<-`(names(df)[(NA^!df)*col(df)], dim(df))), sep="_")))
    }

ae <- function(){
     apply(df > 0, 1, function(x){toString(names(df)[x])})}

ae2 <- function(){
 df %>%
     by_row(~toString(names(.)[.x > 0]), 
     .collate = 'cols', .to = 'Category')
 }

MC <- function(){
    df2[ , I := .I]

df2[melt(df2, id.vars = "I")[value != 0,
                           paste(variable, collapse = "_"),
                           keyby = I],
   Category := i.V1, on = "I"][]
  }

Eric <- function() {
         paste(ifelse(df$k1>0, 'k1_',''),
          ifelse(df$k2>0, 'k2_',''),
          ifelse(df$k3>0, 'k3_',''),
          ifelse(df$k4>0, 'k4_',''), sep='')
    }

Benchmark Output - system.time

system.time(psidom())
#   user  system elapsed 
#   7.91    0.06    7.97 

system.time(ae())
#   user  system elapsed 
#  10.22    0.00   10.22 

system.time(ae2())
#   user  system elapsed 
# 100.60    0.27  101.44 


system.time(MC())
#   user  system elapsed 
#   4.22    0.03    4.25 

system.time(Eric())
#   user  system elapsed 
#   1.40    0.00    1.41 

system.time(akrun())
#   user  system elapsed 
#   1.53    0.00    1.53 

Benchmark Output - microbenchmark

library(microbenchmark)
microbenchmark(psidom(), akrun(), ae(), ae2(), MC(), Eric(), unit = "relative",
       times = 10)
#Unit: relative
#     expr        min         lq       mean    median         uq        max neval
# psidom()  4.0824126  4.1283338  3.9332463  4.237229  3.4060509  4.2147045    10
#  akrun()  1.0000000  1.0000000  1.0000000  1.000000  1.0000000  1.0000000    10
#     ae()  6.7507093  6.9175511  6.0683960  6.725867  5.1087104  5.1901925    10
#    ae2() 62.4294377 61.4709644 53.7637154 59.873279 44.9316386 44.9233634    10
#     MC()  3.1439541  3.4666872  3.1479070  3.559120  2.7554062  2.8741309    10
#   Eric()  0.9091862  0.9628939  0.9702425  1.042875  0.9878793  0.9686051    10

Discussion/Comments

@Eric's method is the fastest, but when the number of columns are more, then nested ifelse statements will also be more..

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM