简体   繁体   中英

How to combine multiple character columns into one columns and remove NA without knowing column numbers

I would like to have a column that contains other columns characters without NA. I have tried paste , str_c and unite , but could not get the expected result. Maybe I used them incorrectly.

The real case is, I could not know the column numbers in advance since each dataset can be varied in terms of years.

ie some datasets contain 10 years, but some contain 20 years.

Here is the input data:

input <- tibble(
  id = c('aa', 'ss', 'dd', 'qq'),
  '2017' = c('tv', NA, NA, 'web'),
  '2018' = c(NA, 'web', NA, NA),
  '2019' = c(NA, NA, 'book', 'tv')
)

# A tibble: 4 x 4
  id    `2017` `2018` `2019`
  <chr> <chr>  <chr>  <chr> 
1 aa    tv     NA     NA    
2 ss    NA     web    NA    
3 dd    NA     NA     book  
4 qq    web    NA     tv    

The desired output with the ALL column is:

> output
# A tibble: 4 x 5
  id    `2017` `2018` `2019` ALL   
  <chr> <chr>  <chr>  <chr>  <chr> 
1 aa    tv     NA     NA     tv    
2 ss    NA     web    NA     web   
3 dd    NA     NA     book   book  
4 qq    web    NA     tv     web tv

Thanks for the help!

This actually is duplicate (or is really close) of this question but things have changed since then . unite has na.rm parameter which helps to drop NA s.

As far as selection of columns is concerned, here we have selected all the columns ignoring the first one without specifying the column names so it should work for your case with multiple years.

library(tidyverse)

input %>%
    unite("ALL", names(input)[-1], remove = FALSE, sep = " ", na.rm = TRUE)

# A tibble: 4 x 5
#  id    ALL    `2017` `2018` `2019`
#  <chr> <chr>  <chr>  <chr>  <chr> 
#1 aa    tv     tv     NA     NA    
#2 ss    web    NA     web    NA    
#3 dd    book   NA     NA     book  
#4 qq    web tv web    NA     tv    

It worked for me after installing the development version of tidyr by doing

devtools::install_github("tidyverse/tidyr")

Here is a base R method

input$ALL <- apply(input[-1], 1, function(x) paste(na.omit(x), collapse=" "))
input$ALL
#[1] "tv"     "web"    "book"   "web tv"

For the sake of completeness (and to supplement LocoGris' data.table answer ), there are three other approaches which update input by reference , ie, without copying the whole data object.

All approaches return the same result and can handle an arbitrary number of years.

Note that id is supposed to be a unique key, ie, without any duplicates.

Reshape, na.omit() , aggregate

library(data.table)
setDT(input)[, ALL := melt(input, id.var = "id")[, toString(na.omit(value)), by = id]$V1][]
  id 2017 2018 2019 ALL 1: aa tv <NA> <NA> tv 2: ss <NA> web <NA> web 3: dd <NA> <NA> book book 4: qq web <NA> tv web, tv 

BTW, reshaping from wide to long format exhibits a more concise way to store the sparsely populated data.

melt(input, id.var = "id", na.rm = TRUE)
  id variable value 1: aa 2017 tv 2: qq 2017 web 3: ss 2018 web 4: dd 2019 book 5: qq 2019 tv 

Reshape, aggregate, join

library(data.table)
setDT(input)[melt(input, id.var = "id", na.rm = TRUE)[, toString(value), by = id],
             on = "id", ALL := V1][]

This drops the NA values from the result of the reshape step which distorts the original row order due to the many NA . Hence, an update join is required.

Filter() , aggregate

library(data.table)
setDT(input)[, ALL := .SD[, toString(Filter(Negate(is.na), .SD)), by = id]$V1][]

A data.table approach:

library(data.table)
library(tidyverse)
input <- data.table(
  id = c('aa', 'ss', 'dd', 'qq'),
  '2017' = c('tv', NA, NA, 'web'),
  '2018' = c(NA, 'web', NA, NA),
  '2019' = c(NA, NA, 'book', 'tv')
)

""-> input[is.na(input)]
input[, ALL:=paste0(.SD,collapse=" "), .SDcols =2:length(input), by=seq_len(nrow(input))] 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM