简体   繁体   中英

Consolidating multiple entries per cell from 2 columns in R

I am trying to clean up some data in R but I am struggeling to get it done. Currently, I've multiple columns, some of which with multiple values/entries per cell. However, I only care about the names and matching numbers.

Here's my data as of now:

ID  Name(s) Number(s) ...
#1  X, Y    123, 456
#2  Z       789
#3  Y, Z    456, 789
#4  W       0
...

What I want to achieve is a clean list of names matched with the corresponding number, like this:

Name  Number
W     0
X     123
Y     456
Z     789

The same number always corresponds to the same name, I simply don't have a clean version of this data. I would appreaciate your help!

We can use separate_rows to get comma-separated values in different rows, arrange the data and select only unique rows with distinct .

library(dplyr)

df %>%
  tidyr::separate_rows(Name, Number, sep = ",") %>%
  select(-ID) %>%
  arrange_all() %>%
  distinct()

#  Name Number
#1    W      0
#2    X    123
#3    Y    456
#4    Z    789

data

df <- structure(list(ID = 1:4, Name = c("X,Y", "Z", "Y,Z", "W"), 
      Number = c("123,456", "789", "456,789", "0")), 
      class = "data.frame", row.names = c(NA, -4L))

We can use cSplit to split the data into 'long' format

library(splitstackshape)
library(data.table)
unique(cSplit(df, c("Name", "Number"), ",", "long")[order(Name, Number),
     .(Name, Number)])
#   Name Number
#1:    W      0
#2:    X    123
#3:    Y    456
#4:    Z    789

data

df <- structure(list(ID = 1:4, Name = c("X,Y", "Z", "Y,Z", "W"), 
      Number = c("123,456", "789", "456,789", "0")), 
      class = "data.frame", row.names = c(NA, -4L))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM