简体   繁体   中英

Complement a DNA sequence

Suppose I have a DNA sequence. I want to get the complement of it. I used the following code but I am not getting it. What am I doing wrong ?

s=readline()
ATCTCGGCGCGCATCGCGTACGCTACTAGC
p=unlist(strsplit(s,""))
h=rep("N",nchar(s))
unlist(lapply(p,function(d){
for b in (1:nchar(s)) {    
    if (p[b]=="A") h[b]="T"
    if (p[b]=="T") h[b]="A"
    if (p[b]=="G") h[b]="C"
    if (p[b]=="C") h[b]="G"
}

Use chartr which is built for this purpose:

> s
[1] "ATCTCGGCGCGCATCGCGTACGCTACTAGC"
> chartr("ATGC","TACG",s)
[1] "TAGAGCCGCGCGTAGCGCATGCGATGATCG"

Just give it two equal-length character strings and your string. Also vectorised over the argument for translation:

> chartr("ATGC","TACG",c("AAAACG","TTTTT"))
[1] "TTTTGC" "AAAAA" 

Note I'm doing the replacement on the string representation of the DNA rather than the vector. To convert the vector I'd create a lookup-map as a named vector and index that:

> p
 [1] "A" "T" "C" "T" "C" "G" "G" "C" "G" "C" "G" "C" "A" "T" "C" "G" "C" "G" "T"
[20] "A" "C" "G" "C" "T" "A" "C" "T" "A" "G" "C"
> map=c("A"="T", "T"="A","G"="C","C"="G")
> unname(map[p])
 [1] "T" "A" "G" "A" "G" "C" "C" "G" "C" "G" "C" "G" "T" "A" "G" "C" "G" "C" "A"
[20] "T" "G" "C" "G" "A" "T" "G" "A" "T" "C" "G"

The Bioconductor package Biostrings has many useful functions for this sort of operation. Install once:

source("http://bioconductor.org/biocLite.R")
biocLite("Biostrings")

then use

library(Biostrings)
dna = DNAStringSet(c("ATCTCGGCGCGCATCGCGTACGCTACTAGC", "ACCGCTA"))
complement(dna)

To complement, in both upper and lower case, you can use chartr() :

n <- "ACCTGccatGCATC"
chartr("acgtACGT", "tgcaTGCA", n)
# [1] "TGGACggtaCGTAG"

To take it a step further and reverse complement the nucleotide sequence, you can use the following function:

library(stringi)

rc <- function(nucSeq)
  return(stri_reverse(chartr("acgtACGT", "tgcaTGCA", nucSeq)))

rc("AcACGTgtT")
# [1] "AacACGTgT"
sapply(p, switch,  "A"="T", "T"="A","G"="C","C"="G")
  A   T   C   T   C   G   G   C   G   C   G   C   A   T   C   G   C   G   T 
"T" "A" "G" "A" "G" "C" "C" "G" "C" "G" "C" "G" "T" "A" "G" "C" "G" "C" "A" 
  A   C   G   C   T   A   C   T   A   G   C 
"T" "G" "C" "G" "A" "T" "G" "A" "T" "C" "G" 

If you do not want the complementary names, you can always strip them with unname .

unname(sapply(p, switch,  "A"="T", "T"="A","G"="C","C"="G") )
 [1] "T" "A" "G" "A" "G" "C" "C" "G" "C" "G" "C" "G" "T" "A" "G" "C" "G" "C"
[19] "A" "T" "G" "C" "G" "A" "T" "G" "A" "T" "C" "G"
> 

There is also a package seqinr

library(seqinr)
comp(seq) # gives complement
rev(comp(seq)) # gives the reverse complement

Biostrings has a much smaller memory profile, but seqinr is nice also because you can choose the case of the bases (including mixed) and change them to anything you want, for example if you want a mix of T and U in the same sequence. Biostrings forces you to have either T or U.

Here a answer using base r. Written with a horrible formatting to make things clear and to keep it as a one-liner. It supports upper and lower cases.

revc = function(s){
       paste0(
           rev(
            unlist(
             strsplit(
                chartr("ATGCatgc","TACGtacg",s)
                      , "")                        # from strsplit
                   )                               # from unlist
               )                                   # from rev
             , collapse='')                        # from paste0
       }

I've generalised the solution rev(comp(seq)) with the seqinr package:

install.packages("devtools")
devtools::install_github("TomKellyGenetics/tktools")
tktools::revcomp(seq)

This version is compatible with string inputs and is vectorised to handle list or vector input of multiple strings. The output class should match the input, including cases and types. This also support inputs containing "U" for RNA and RNA output sequences.

> seq <- "ATCTCGGCGCGCATCGCGTACGCTACTAGC"
> revcomp(seq)
[1] "GCTAGTAGCGTACGCGATGCGCGCCGAGAT"

> seq <- c("TATAAT", "TTTCGC", "atgcat")
> revcomp(seq)
  TATAAT   TTTCGC   atgcat 
 "ATTATA" "GCGAAA" "atgcat" 

See the manual or the TomKellyGenetics/tktools github package repository.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM