简体   繁体   中英

Combine rows based on multiple columns and keep all unique values

I have a dataset with User Information. For a specific user I have often multiple rows with more or less complete information. I want to summarize all rows that belong to a customer on the basis of First_Name, Last_Name, Street while keeping all information of the other columns and if there are two unique observation for a specific column I want to collapse them with ",".

This is what the df looks like

First_Name Last_Name Street Column1 Colum2 Colum_n

Mike       Smith      X     abc     ab     a
Mike       Smith      X     abc     ad     b
John       Smith      Y     xyz     xy     n
John       Smith      Y     xyz     xm     NA

My desired output would be

 First_Name Last_Name Street Column1 Colum2 Colum_n

 Mike       Smith      X     abc     ab,ad     a,b
 John       Smith      Y     xyz     xy,xm       n

I would like using dplyr and tried something with

df %>% 
group_by(First_Name,Last_Name, Street) %>%
summarise_all(funs())

The problem with that function is that I only had the option of using something like the mean or the first occuring value for a column and this would mean the loss of values. What I would like are columns with all unique values without NA's

You can write your own summarization function like

concat_unique <- function(x){paste(unique(x), collapse=',')}

and then apply it using summarize_all(concat_unique)

A solution using tidyverse .

library(tidyverse)

dat2 <- dat %>%
  group_by(First_Name, Last_Name, Street) %>%
  # Replace NA with ""
  mutate_all(funs(replace(., is.na(.), ""))) %>%
  # Combine all strings
  summarize_all(funs(toString(unique(.)))) %>%
  # Replace the strings ended with ", "
  mutate_all(funs(str_replace(., ", $", ""))) %>%
  ungroup()
dat2
# # A tibble: 2 x 6
#   First_Name Last_Name Street Column1 Colum2 Colum_n
#   <chr>      <chr>     <chr>  <chr>   <chr>  <chr>  
# 1 John       Smith     Y      xyz     xy, xm n      
# 2 Mike       Smith     X      abc     ab, ad a, b    

After seeing others answer, I realized that we don't have to deal with NA and , as strings. The following is more efficient.

dat2 <- dat %>%
  group_by(First_Name, Last_Name, Street) %>%
  # Combine all strings
  summarize_all(funs(toString(unique(.[!is.na(.)])))) %>%
  ungroup()
dat2
# # A tibble: 2 x 6
#   First_Name Last_Name Street Column1 Colum2 Colum_n
#   <chr>      <chr>     <chr>  <chr>   <chr>  <chr>  
# 1 John       Smith     Y      xyz     xy, xm n      
# 2 Mike       Smith     X      abc     ab, ad a, b  

DATA

dat <- read.table(text = 'First_Name Last_Name Street Column1 Colum2 Colum_n
Mike       Smith      X     abc     ab     a
Mike       Smith      X     abc     ad     b
John       Smith      Y     xyz     xy     n
John       Smith      Y     xyz     xm     NA',
                  header = TRUE, stringsAsFactors = FALSE)

If you want to keep them as a vector, instead of converting them to a single character string, you can do

library(dplyr)

df %>% 
  group_by(First_Name,Last_Name, Street) %>%
  summarise_all(~list(unique(.[!is.na(.)]))) %>% 
  print.data.frame

#   First_Name Last_Name Street Column1 Colum2 Colum_n
# 1       John     Smith      Y     xyz xy, xm       n
# 2       Mike     Smith      X     abc ab, ad    a, b

or with data.table

library(data.table)
setDT(df)

df[, lapply(.SD, function(x) .(unique(x[!is.na(x)])))
   , by = .(First_Name,Last_Name, Street)]

#    First_Name Last_Name Street Column1 Colum2 Colum_n
# 1:       Mike     Smith      X     abc  ab,ad     a,b
# 2:       John     Smith      Y     xyz  xy,xm       n

Using tidyverse :

df %>%
 group_by(First_Name, Last_Name, Street) %>%
 summarise_all(funs(paste0(unique(.[!is.na(.)]), collapse= ",")))

  First_Name Last_Name Street Column1 Colum2 Colum_n
  <fct>      <fct>     <fct>  <chr>   <chr>  <chr>  
1 John       Smith     Y      xyz     xy,xm  n      
2 Mike       Smith     X      abc     ab,ad  a,b 

First, it is grouping by "First_Name", "Last_Name" and "Street". Then, it takes all the unique non-NA values and collapses them into one string.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM