简体   繁体   中英

Rearrange data by matching columns

I am having issue with rearranging some data.

The original data is:

structure(list(id = 1:3, artery.1 = structure(c(1L, 1L, 2L), .Label = c("a", 
"b"), class = "factor"), artery.2 = structure(c(1L, NA, 2L), .Label = c("b", 
"c"), class = "factor"), artery.3 = structure(c(1L, NA, 2L), .Label = c("c", 
"d"), class = "factor"), artery.4 = structure(c(NA, NA, 1L), .Label = "e", class = "factor"), artery.5 = structure(c(NA, NA, 1L), .Label = "f", class = "factor"), 
diameter.1 = c(3L, 2L, 1L), diameter.2 = c(2L, NA, 2L), diameter.3 = c(3L, 
NA, 3L), diameter.4 = c(NA, NA, 4L), diameter.5 = c(NA, NA, 
5L)), .Names = c("id", "artery.1", "artery.2", "artery.3", 
"artery.4", "artery.5", "diameter.1", "diameter.2", "diameter.3", 
"diameter.4", "diameter.5"), class = "data.frame", row.names = c(NA, 
-3L))

#   id artery.1 artery.2 artery.3 artery.4 artery.5 diameter.1 diameter.2 diameter.3 diameter.4 diameter.5
# 1  1        a        b        c     <NA>     <NA>          3          2          3         NA         NA
# 2  2        a     <NA>     <NA>     <NA>     <NA>          2         NA         NA         NA         NA
# 3  3        b        c        d        e        f          1          2          3          4          5

I would like to get to this:

structure(list(id = 1:3, a = c(3L, 2L, NA), b = c(2L, NA, 1L), 
c = c(3L, NA, 2L), d = c(NA, NA, 3L), e = c(NA, NA, 4L), 
f = c(NA, NA, 5L)), .Names = c("id", "a", "b", "c", "d", 
"e", "f"), class = "data.frame", row.names = c(NA, -3L))

#   id  a  b  c  d  e  f
# 1  1  3  2  3 NA NA NA
# 2  2  2 NA NA NA NA NA
# 3  3 NA  1  2  3  4  5

Basically, a to f represents arteries and the numerical values represent the corresponding diameter. Each row represents a patient.

Is there a neat way to sort this dataframe out?

Modern tidyr makes the solution even more succinct via the pivot_ functions:

library(dplyr)
library(tidyr)

df %>% 
  pivot_longer(-id, names_pattern = '(artery|diameter)\\.(\\d+)', names_to = c('.value', NA)) %>% 
  filter(!is.na(artery)) %>% 
  pivot_wider(names_from = artery, values_from = diameter)

     id     a     b     c     d     e     f
  <int> <int> <int> <int> <int> <int> <int>
1     1     3     2     3    NA    NA    NA
2     2     2    NA    NA    NA    NA    NA
3     3    NA     1     2     3     4     5

Here is the older solution, which uses the deprecated gather and spread functions:

library(dplyr)
library(tidyr)

new.df <- gather(df, variable, value, artery.1:diameter.5) %>% 
    separate(variable, c('variable', 'num')) %>% 
    spread(variable, value) %>% 
    subset(!is.na(artery)) %>%
    mutate(diameter = as.numeric(diameter)) %>% 
    select(-num) %>% 
    spread(artery, diameter)

Output:

  id  a  b  c  d  e  f
1  1  3  2  3 NA NA NA
2  2  2 NA NA NA NA NA
3  3 NA  1  2  3  4  5

Or using melt / dcast combination with data.table while selecting variables using regex in the patterns function

library(data.table) #v>=1.9.6
dcast(melt(setDT(df), 
           id = "id", 
           measure = patterns("artery", "diameter")),
      id ~ value1, 
      sum, 
      value.var = "value2", 
      subset = .(!is.na(value2)), 
      fill = NA)
#    id  a  b  c  d  e  f
# 1:  1  3  2  3 NA NA NA
# 2:  2  2 NA NA NA NA NA
# 3:  3 NA  1  2  3  4  5

As you can see, both melt and dcast are very flexible and you can use regex, specify a subset, pass multiple functions and specify how you want to fill missing values.

You can use xtabs with reshape from base R. Use the latter to transform data to long format and use the former to get the count table:

xtabs(diameter ~ id + artery, reshape(df, varying = 2:11, sep = '.', dir = "long"))

#   artery
#id  a b c d e f
#  1 3 2 3 0 0 0
#  2 2 0 0 0 0 0
#  3 0 1 2 3 4 5

This can be done with two reshape() calls. First, we can longify both artery and diameter on id , then widen with artery as the time variable. To prevent a column of NAs, we also must subset out rows with NA values for artery in the intermediate frame.

reshape(subset(reshape(df,dir='l',varying=setdiff(names(df),'id'),timevar=NULL),!is.na(artery)),dir='w',timevar='artery');
##     id diameter.a diameter.b diameter.c diameter.d diameter.e diameter.f
## 1.1  1          3          2          3         NA         NA         NA
## 2.1  2          2         NA         NA         NA         NA         NA
## 3.1  3         NA          1          2          3          4          5

The diameter. prefixes can be removed afterward, if desired. However, an advantage of this solution is that it would be capable of preserving multiple column sets, whereas the xtabs() solution cannot. The prefixes would be essential to distinguish the column sets in that case.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM