简体   繁体   English

补充 DNA 序列

[英]Complement a DNA sequence

Suppose I have a DNA sequence.假设我有一个 DNA 序列。 I want to get the complement of it.我想得到它的补充。 I used the following code but I am not getting it.我使用了以下代码,但我没有得到它。 What am I doing wrong ?我究竟做错了什么 ?

s=readline()
ATCTCGGCGCGCATCGCGTACGCTACTAGC
p=unlist(strsplit(s,""))
h=rep("N",nchar(s))
unlist(lapply(p,function(d){
for b in (1:nchar(s)) {    
    if (p[b]=="A") h[b]="T"
    if (p[b]=="T") h[b]="A"
    if (p[b]=="G") h[b]="C"
    if (p[b]=="C") h[b]="G"
}

Use chartr which is built for this purpose:使用为此目的构建的chartr

> s
[1] "ATCTCGGCGCGCATCGCGTACGCTACTAGC"
> chartr("ATGC","TACG",s)
[1] "TAGAGCCGCGCGTAGCGCATGCGATGATCG"

Just give it two equal-length character strings and your string.只需给它两个等长的字符串和你的字符串。 Also vectorised over the argument for translation:还对翻译参数进行了矢量化:

> chartr("ATGC","TACG",c("AAAACG","TTTTT"))
[1] "TTTTGC" "AAAAA" 

Note I'm doing the replacement on the string representation of the DNA rather than the vector.注意我正在替换 DNA 的字符串表示而不是向量。 To convert the vector I'd create a lookup-map as a named vector and index that:要转换向量,我将创建一个查找图作为命名向量和索引:

> p
 [1] "A" "T" "C" "T" "C" "G" "G" "C" "G" "C" "G" "C" "A" "T" "C" "G" "C" "G" "T"
[20] "A" "C" "G" "C" "T" "A" "C" "T" "A" "G" "C"
> map=c("A"="T", "T"="A","G"="C","C"="G")
> unname(map[p])
 [1] "T" "A" "G" "A" "G" "C" "C" "G" "C" "G" "C" "G" "T" "A" "G" "C" "G" "C" "A"
[20] "T" "G" "C" "G" "A" "T" "G" "A" "T" "C" "G"

The Bioconductor package Biostrings has many useful functions for this sort of operation. BioconductorBiostrings为此类操作提供了许多有用的功能。 Install once:安装一次:

source("http://bioconductor.org/biocLite.R")
biocLite("Biostrings")

then use然后使用

library(Biostrings)
dna = DNAStringSet(c("ATCTCGGCGCGCATCGCGTACGCTACTAGC", "ACCGCTA"))
complement(dna)

To complement, in both upper and lower case, you can use chartr() :作为补充,无论是大写还是小写,您都可以使用chartr()

n <- "ACCTGccatGCATC"
chartr("acgtACGT", "tgcaTGCA", n)
# [1] "TGGACggtaCGTAG"

To take it a step further and reverse complement the nucleotide sequence, you can use the following function:要更进一步并反向互补核苷酸序列,您可以使用以下函数:

library(stringi)

rc <- function(nucSeq)
  return(stri_reverse(chartr("acgtACGT", "tgcaTGCA", nucSeq)))

rc("AcACGTgtT")
# [1] "AacACGTgT"
sapply(p, switch,  "A"="T", "T"="A","G"="C","C"="G")
  A   T   C   T   C   G   G   C   G   C   G   C   A   T   C   G   C   G   T 
"T" "A" "G" "A" "G" "C" "C" "G" "C" "G" "C" "G" "T" "A" "G" "C" "G" "C" "A" 
  A   C   G   C   T   A   C   T   A   G   C 
"T" "G" "C" "G" "A" "T" "G" "A" "T" "C" "G" 

If you do not want the complementary names, you can always strip them with unname .如果您不想要互补名称,您可以随时使用unname它们。

unname(sapply(p, switch,  "A"="T", "T"="A","G"="C","C"="G") )
 [1] "T" "A" "G" "A" "G" "C" "C" "G" "C" "G" "C" "G" "T" "A" "G" "C" "G" "C"
[19] "A" "T" "G" "C" "G" "A" "T" "G" "A" "T" "C" "G"
> 

There is also a package seqinr还有一个包seqinr

library(seqinr)
comp(seq) # gives complement
rev(comp(seq)) # gives the reverse complement

Biostrings has a much smaller memory profile, but seqinr is nice also because you can choose the case of the bases (including mixed) and change them to anything you want, for example if you want a mix of T and U in the same sequence. Biostrings 的内存配置文件要小得多,但 seqinr 也很好,因为您可以选择碱基的大小写(包括混合)并将它们更改为您想要的任何内容,例如,如果您想要在相同序列中混合使用 T 和 U。 Biostrings forces you to have either T or U. Biostrings 强制您使用 T 或 U。

Here a answer using base r.这是使用基数 r 的答案。 Written with a horrible formatting to make things clear and to keep it as a one-liner.用可怕的格式编写,以使事情清楚并保持单行。 It supports upper and lower cases.它支持大小写。

revc = function(s){
       paste0(
           rev(
            unlist(
             strsplit(
                chartr("ATGCatgc","TACGtacg",s)
                      , "")                        # from strsplit
                   )                               # from unlist
               )                                   # from rev
             , collapse='')                        # from paste0
       }

I've generalised the solution rev(comp(seq)) with the seqinr package:我已经用seqinr包概括了解决方案rev(comp(seq))

install.packages("devtools")
devtools::install_github("TomKellyGenetics/tktools")
tktools::revcomp(seq)

This version is compatible with string inputs and is vectorised to handle list or vector input of multiple strings.此版本与字符串输入兼容,并被向量化以处理多个字符串的列表或向量输入。 The output class should match the input, including cases and types.输出类应与输入匹配,包括案例和类型。 This also support inputs containing "U" for RNA and RNA output sequences.这也支持包含“U”的输入用于 RNA 和 RNA 输出序列。

> seq <- "ATCTCGGCGCGCATCGCGTACGCTACTAGC"
> revcomp(seq)
[1] "GCTAGTAGCGTACGCGATGCGCGCCGAGAT"

> seq <- c("TATAAT", "TTTCGC", "atgcat")
> revcomp(seq)
  TATAAT   TTTCGC   atgcat 
 "ATTATA" "GCGAAA" "atgcat" 

See the manual or the TomKellyGenetics/tktools github package repository.请参阅手册TomKellyGenetics/tktools github 包存储库。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM