简体   繁体   中英

Removing columns from a data.table in R based on conditions

How do I remove columns based on values in a data.table in R

If I have a data.table row,

dt = data.table("col1" = "a", "col2" = "b", "col3" = "c", 
"col4" = 'd', "col5" = "e", "col6" = 9, "col7" = 0, "col8" = 7,
"col9" = 0, "col10" = 99)

The first 5 columns are categorical, and the 6-10 columns are numerical. The numbers are repeated for all rows for the numerical columns.

I had two doubts

  1. How do I remove the columns containing 0s? This column can vary based on inputs - ie sometimes col7 might be 0 sometimes col8 might be 0 etc

  2. Once I remove the columns with 0 values, how do I concatenate the rest of the numbers into a single column - in this case, the new column will contain the number 9799

Is there a way to do this without removing the 0 value columns?

For the first part, I tried

cols_chosen = c("col6", "col7","col8","col9","col10")

condition = c(FALSE, dt[, lapply(.SD, function(x) sum(x)< 1), .SDcols = cols_chosen])

dt[, which(condition) := NULL]

While I am getting the correct value for the conditions (a list of 5 logical values), the last command is failing with the error

Error in which(condition) : argument to 'which' is not logical

I had taken the above statements from an earlier answer Remove columns of dataframe based on conditions in R

dt = data.table("col1" = "a", "col2" = "b", "col3" = "c", 
"col4" = 'd', "col5" = "e", "col6" = 9, "col7" = 0, "col8" = 7,
"col9" = 0, "col10" = 99)

not0 = function(x) is.numeric(x) && !anyNA(x) && all(x!=0)
dt[, .(
    ## your categorical columns
    col1, col2, col3, col4, col5,
    ## new column pasted from non-0 numeric columns
    new = as.numeric(paste0(unlist(.SD), collapse=""))
  ),
  ## this filters columns to be provided in .SD column subset
  .SDcols = not0,
  ## we group by each row so it will handle input of multiple rows
  by = .(row=seq_len(nrow(dt)))
  ][, row:=NULL ## this removes extra grouping column
    ][] ## this prints
#   col1 col2 col3 col4 col5  new
#1:    a    b    c    d    e 9799

Alternatively if you want to update in place existing table

is0 = function(x) is.numeric(x) && !anyNA(x) && all(x==0)
## remove columns that has 0
dt[, which(sapply(dt, is0)) := NULL]

## add new column
dt[, new := as.numeric(
    paste0(unlist(.SD), collapse="")
  ), .SDcols=is.numeric, by=.(row=seq_len(nrow(dt)))
  ][]
#   col1 col2 col3 col4 col5 col6 col8 col10  new
#1:    a    b    c    d    e    9    7    99 9799
  1. To remove the column containing (assuming the numbers are repeated as you say), it would be as simple as checking the first row to see if any elements are equal to 0 and keeping the columns that aren't:
dt <- data.frame("col1" = "a", "col2" = "b", "col3" = "c", 
"col4" = 'd', "col5" = "e", "col6" = 9, "col7" = 0, "col8" = 7,
"col9" = 0, "col10" = 99)
dt <- dt[,dt[1,] != 0]

This leaves us with dt as:

  col1 col2 col3 col4 col5 col6 col8 col10
1    a    b    c    d    e    9    7    99
  1. To concatenate the remaining numeric columns (assuming they are all integers) into a new column, you can use lapply on the first row to get a logical vector indicating the columns with numbers. Then you can convert them to strings and paste them together in the new column.
numTag <- unlist(lapply(X = dt[1,], FUN = is.numeric))
dt$new_col <- rep(as.numeric(paste(as.character(dt[1,numTag]), collapse = '', sep = '')), nrow(dt))

Now dt looks like:

  col1 col2 col3 col4 col5 col6 col8 col10 new_col
1    a    b    c    d    e    9    7    99    9799
  1. To do this without removing the zero-valued columns, the only necessary twist is to filter out the zeroes from our initial logical vector:
numTag <- unlist(lapply(X = dt[1,], FUN = is.numeric))
numTag <- numTag & (dt[1,] != 0)

dt$new_col <- rep(as.numeric(paste(as.character(dt[1,numTag]), collapse = '', sep = '')), nrow(dt))
dt
  col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 new_col
1    a    b    c    d    e    9    0    7    0    99    9799
library(data.table)
library(dplyr)
library(tidyr)

dt = data.table("col1" = "a", "col2" = "b", "col3" = "c", 
                "col4" = 'd', "col5" = "e", "col6" = 9, "col7" = 0, "col8" = 7,
                "col9" = 0, "col10" = 99)


## which rows contain zeros?
zero_vars <-  dt %>% 
  dplyr::select_if(~max(.x) == 0) %>% 
  colnames()


## which row contains non-zero numeric vars?
numeric_vars <- dt %>% 
  dplyr::select(-all_of(zero_vars)) %>% 
  dplyr::select_if(is.numeric) %>% 
  colnames()
                  

## creat new table 
collapsed_dt <- 
  dt %>% 
  dplyr::select(all_of(numeric_vars)) %>%   ## select only non-zero numeric vars
  mutate_all(as.character) %>% 
  unite( col = "collapsed_var", sep = "") ## unite them to new var 'collapsed_var'


## re-join the collapsed var to the original table
dt %>% 
  dplyr::select_if(is.character) %>% ## only character variables
  cbind(collapsed_dt) ## bind the collapsed_dt 

            

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM