简体   繁体   中英

Reshaping dataframe in R for twin data

I have a data frame like the one shown below. Each pair of ID numbers (eg 2891, 2892) corresponds to pair of twins.

    ID zyg.x CDsum
1 2891     2     0            
2 2892     2     5            
3 4000     1     0           
4 4001     1     0            
5 4006     2     0
6 4007     2     3

I would like to reshape this data frame to make it look like this... Note that the zyg.x (zygosity) value is the same for each twin in the pair.

           Twin Pair     zyg   CDsumTwin1   CDsumTwin2
1   pair1(2891,2892)       2            0            5
2   pair2(4000,4001)       1            0            0
3   pair3(4006,4007)       2            0            3

Any help would be much appreciated.

Data:

df <- read.table(text = "    ID zyg.x CDsum
1 2891     2     0            
2 2892     2     5            
3 4000     1     0           
4 4001     1     0            
5 4006     2     0
6 4007     2     3")

Arrange by ID and create a variable "twin" to distinguish the two twins in each pair

df<- df %>%
  arrange(ID) %>%
  mutate(twin = rep(c(1, 2),length.out = n()))

df
    ID zyg.x CDsum twin
1 2891     2     0    1
2 2892     2     5    2
3 4000     1     0    1
4 4001     1     0    2
5 4006     2     0    1
6 4007     2     3    2

Split df into two dataframes - for twin1 and twin2

twin1 <- df %>%
  filter(twin == 1) %>%
  select(-twin) %>%
  rename(CDsumTwin1 = CDsum, 
         ID1 = ID)

twin1
   ID1 zyg.x CDsumTwin1
1 2891     2          0
3 4000     1          0
5 4006     2          0

twin2 <- df %>%
  filter(twin == 2) %>%
  select(-twin) %>%
  rename(CDsumTwin2 = CDsum,
         ID2 = ID)

twin2
   ID2 zyg.x CDsumTwin2
2 2892     2          5
4 4001     1          0
6 4007     2          3

cbind, combine and rearrange columns:

twin1 %>% cbind(twin2 %>% select(-zyg.x)) %>%
  mutate(`Twin Pair` = paste0("pair (", ID1, ", ", ID2, ")")) %>%
  select(`Twin Pair`, zyg.x, CDsumTwin1, CDsumTwin2)
    
          Twin Pair zyg.x CDsumTwin1 CDsumTwin2
1 pair (2891, 2892)     2          0          5
3 pair (4000, 4001)     1          0          0
5 pair (4006, 4007)     2          0          3

Here is how we could achieve this with dplyr only:

library(dplyr)
df %>% 
  mutate(rn = ceiling(row_number()/2)) %>% 
  group_by(rn) %>% 
  mutate(Twin_Pair = paste0(ID, collapse = ","),
         Twin_Pair = paste0("pair",rn, "(",Twin_Pair, ")")) %>% 
  mutate(CDsumTwin1 = first(CDsum),
         CDsumTwin2 = last(CDsum), .keep="unused") %>%  
  slice(2) %>% 
  ungroup() %>% 
  select(Twin_Pair, zyg=zyg.x, CDsumTwin1, CDsumTwin2)

output:

  Twin_Pair          zyg CDsumTwin1 CDsumTwin2
  <chr>            <dbl>      <dbl>      <dbl>
1 pair1(2891,2892)     2          0          5
2 pair2(4000,4001)     1          0          0
3 pair3(4006,4007)     2          0          3

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM