简体   繁体   中英

Replace different values in one column, according to the row information in another column

I am actually working with genomic data, and I have one data frame, which I am going to show you the first three rows (see table below):

Chrom |   POS    |     ID      | REF | ALT | HapA | HapB |
----------------------------------------------------------
 22   | 16495833 | rs116911124 |  A  |  C  |   1  |  0   |
 22   | 19873357 | rs116378360 |  T  |  A  |   0  |  1   |
 22   | 21416404 | rs117982183 |  T  |  T  |   0  |  .   |

So, I would like to replace the values of "0", "1" and "." from the "HapA" and "HapB" columns according to the REF and ALT columns for every row in the data frame. For example:

a) for the first row I want to change the "1" in HapA column for the "C" in the ALT column, and the "0" in the HapB column for the "A" value in the REF column

b) for the second row change the "0" for the "T" in the "REF" column and the "1" for the "A" in the "ALT" column.

c) And finally, for the "." change it for "NA"

I think that this could be achieved using "if else" or with data.table.

Thank you very much.

I think if_else() , recode() , or case_when() could all work for this. Here I've tried to use mutate_at() to apply the function to both HapA and HapB. In case one of the values in those columns is not equal to 1,0, or . then the function should return the value as a character string.

mutate_at(df, vars(HapA, HapB),
    function(x) {case_when(x == 1 ~ .$ALT,
                     x == 0 ~ .$REF,
                     x == . ~ NA_character_,
                     TRUE ~ as.character(x)) } )

It's a bit unclear what you want exactly, since you don't specify what should happen to the 0 in the third row of the HapA column, but given what you said, this is a dplyr solution:

library(dplyr)

df <- read.table(text = "
'Chrom'     'POS'      'ID'       'REF'  'ALT' 'HapA' 'HapB'
22     16495833   'rs116911124'    'A'     'C'      1     0  
22     19873357   'rs116378360'    'T'     'A'      0     1  
22     21416404   'rs117982183'    'T'     'T'      0     .", header = T, stringsAsFactors = F)

df %>%
  mutate(HapA = ifelse(HapA == 1, ALT, ifelse(HapA == 0, REF, NA)),
         HapB = ifelse(HapB == 1, ALT, ifelse(HapB == 0, REF, NA)))

##   Chrom      POS          ID REF ALT HapA HapB
## 1    22 16495833 rs116911124   A   C    C    A
## 2    22 19873357 rs116378360   T   A    T    A
## 3    22 21416404 rs117982183   T   T    T <NA>

There wasn't really a question, but I'm going to guess what it was:

How can I replace the values of HapA and HapB following these rules:

  1. If "0" , then replace with the value of REF .
  2. If "1" , then replace with the value of ALT .
  3. If "." , then replace with NA .

Note that I'm also assuming HapA and HapB are character columns, since . can't be a numeric value.

If this is the right interpretation, there's no need to use fancy tricks. This is an "if-else" problem. Here's a solution using data.table , which I think is common in genomic analysis. First I'll create the example dataset:

library(data.table)

dt <- fread(
  header = TRUE,
  colClasses = c(
    Chrom = "character",
    POS   = "integer",
    ID    = "character",
    REF   = "character",
    ALT   = "character",
    HapA  = "character",
    HapB  = "character"
  ),
  input = "
Chrom  POS        ID               REF     ALT      HapA HapB
22     16495833   'rs116911124'    'A'     'C'      1     0  
22     19873357   'rs116378360'    'T'     'A'      0     1  
22     21416404   'rs117982183'    'T'     'T'      0     ."
)
dt
#    Chrom      POS            ID REF ALT HapA HapB
# 1:    22 16495833 'rs116911124' 'A' 'C'    1    0
# 2:    22 19873357 'rs116378360' 'T' 'A'    0    1
# 3:    22 21416404 'rs117982183' 'T' 'T'    0    .

That was the long part. Here's the short part.

dt[HapA == "0", HapA := REF]
dt[HapA == "1", HapA := ALT]
dt[HapA == ".", HapA := NA]
dt[HapB == "0", HapB := REF]
dt[HapB == "1", HapB := ALT]
dt[HapB == ".", HapB := NA]
dt
#    Chrom      POS            ID REF ALT HapA HapB
# 1:    22 16495833 'rs116911124' 'A' 'C'  'C'  'A'
# 2:    22 19873357 'rs116378360' 'T' 'A'  'T'  'A'
# 3:    22 21416404 'rs117982183' 'T' 'T'  'T'   NA

I strongly suggest writing this out in a simple way, like the above. It's short, has little repetition, and is easily understood at a glance. However, if you'd want to generalize this to a lot of columns, that would require writing a lot of repetitive lines. So here's a loop version:

replaced_columns <- c("HapA", "HapB")  # Switch these out for any
source_columns   <- c("REF", "ALT")    # number of columns

for (rr in replaced_columns) {
  for (source_i in seq_along(source_columns)) {
    target_rows <- which(dt[[rr]] == source_i - 1)
    dt[
      target_rows,
      (rr) := .SD,
      .SDcols = source_columns[source_i]
    ]
  }
}

dt
#    Chrom      POS            ID REF ALT HapA HapB
# 1:    22 16495833 'rs116911124' 'A' 'C'  'C'  'A'
# 2:    22 19873357 'rs116378360' 'T' 'A'  'T'  'A'
# 3:    22 21416404 'rs117982183' 'T' 'T'  'T'    .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM