从背景（DNA序列）中排除特定的字符串（DNA字符串）和改组（即从阳性DNA序列生成阴性集）

Question

I have fasta file including strings of DNA. 我有包含DNA字符串的fasta文件。 I want to generate a negative dataset from positive data. 我想从正数数据生成负数数据集。 One way is to exclude some specific sequences from my data and then shuffle the data. 一种方法是从我的数据中排除一些特定的序列，然后重新整理数据。
Let's say my dataset is a list: 假设我的数据集是一个列表：

1)
DNAlst:
ACTATACGCTAATATCGATCTACGTACGATCG
CAGCAGCAGCGAGACTATCCTACCGCA
ATATCGATCGCAAAAATCG

I want to exclude these sequences: 我想排除这些序列：

ATAT,CGCA

so the result would be: 因此结果将是：

ACTATACGCTACGATCTACGTACGATCG
CAGCAGCAGCGAGACTATCCTAC
CGATAAAAATCG

2) then I want to shuffle my sequence by a specific length (eg 5). 2)然后我想将序列打乱特定的长度（例如5）。 It means to shuffle DNA string by part (5-mer) with length of 5. For example: 意思是将长度为5的部分（5聚体）的DNA字符串混洗。例如：

ATATACGCGAAAAAATCTCTC => result after shuffle by 5 ==> AAAAACTCTCCGCAATATA

I would be thankful you if tell me how to do this in R. 如果您能告诉我如何在R中执行此操作，将不胜感激。

Answer 1

use stringi package: 使用stringi包：

dna <- c("ACTATACGCTAATATCGATCTACGTACGATCG","CAGCAGCAGCGAGACTATCCTACCGCA","ATATCGATCGCAAAAATCG")

# stri_replace function replaces strings ATAT and CGCA for empty string
stri_replace_all_regex(dna, "ATAT|CGCA","")

Now the shuffle part. 现在是洗牌部分。 seq and stri_sub functions will be useful. seq和stri_sub函数将很有用。 First we need to 'cut' our DNA seq into pieces of at most 5 char long. 首先，我们需要将DNA序列“切割”成最多5个字符长的片段。 seq function give us starting points seq函数给我们起点

seq(1,24,5)
## [1]  1  6 11 16 21
seq(1,27,5)
## [1]  1  6 11 16 21 26

stri_sub string from indexes generated by seq of length 5 由长度为5的seq生成的索引中的stri_sub字符串

y <- stri_sub(dna[1], seq(from=1,to=stri_length(dna[1]),by=5), length = 5)
y
## [1] "ACTAT" "ACGCT" "AATAT" "CGATC" "TACGT" "ACGAT" "CG"

sample will shuffle our vector and stri_flatten paste it together into one string. sample将洗改我们的向量，并将stri_flatten粘贴到一个字符串中。

stri_flatten(y[sample(length(y))])
## [1] "TACGTACGATCGATCAATATACGCTACTATCG"

从背景（DNA序列）中排除特定的字符串（DNA字符串）和改组（即从阳性DNA序列生成阴性集）

问题描述

1 个解决方案

解决方案1
1 已采纳 2014-12-23 14:36:17

从背景（DNA序列）中排除特定的字符串（DNA字符串）和改组（即从阳性DNA序列生成阴性集）

问题描述

1 个解决方案

解决方案1 1 已采纳 2014-12-23 14:36:17

解决方案1
1 已采纳 2014-12-23 14:36:17