繁体   English   中英

提取两个子字符串之间匹配的字符串的一部分

[英]Extract part of string that matches between two substrings

我有三个包含一组字符串的文件。 File1和File2包含File3的子字符串。 我想从位于File1和File2中的子串之间的File3中减去字符串。 请看下面的例子:

File1(substring 1):

 head(fivep$V2)
[1] UGAGGUAGUAGUUUGUACAGUU  UGAGGUAGUAGUUUGUGCUGUU  ACAUACUUCUUUAUAUGCCCAUA UAGCAGCACAUCAUGGUUUACA 
[5] GGGUUCCUGGCAUGCUGAUUU   AGAGCUUAGCUGAUUGGUGAAC 

File2(子串2)

 head(threep$V2)
[1] ACUGUACAGGCCACUGCCUUGC CUGCGCAAGCUACUGCCUUGCU UGGAAUGUAAAGAAGUAUGUAU CGAAUCAUUAUUUGCUGCUCUA
[5] AUCACAUUGCCAGGGAUUACC  UUCACAGUGGCUAAGUUCUGC 

文件3

head(hairpin$V2)
[1] UACACUGUGGAUCCGGUGAGGUAGUAGGUUGUAUAGUUUGGAAUAUUACCACCGGUGAACUAUGCAAUUUUCUACCUUACCGGAGACAGAACUCUUCGA
[2] AUGCUUCCGGCCUGUUCCCUGAGACCUCAAGUGUGAGUGUACUAUUGAUGCUUCACACCUGGGCUCUCCGGGUACCAGGACGGUUUGAGCAGAU     
[3] AAAGUGACCGUACCGAGCUGCAUACUUCCUUACAUGCCCAUACUAUAUCAUAAAUGGAUAUGGAAUGUAAAGAAGUAUGUAGAACGGGGUGGUAGU   
[4] UAAACAGUAUACAGAAAGCCAUCAAAGCGGUGGUUGAUGUGUUGCAAAUUAUGACUUUCAUAUCACAGCCAGCUUUGAUGUGCUGCCUGUUGCACUGU 
[5] CGGACAAUGCUCGAGAGGCAGUGUGGUUAGCUGGUUGCAUAUUUCCUUGACAACGGCUACCUUCACUGCCACCCCGAACAUGUCGUCCAUCUUUGAA  
[6] UCUCGGAUCAGAUCGAGCCAUUGCUGGUUUCUUCCACAGUGGUACUUUCCAUUAGAACUAUCACCGGGUGGAAACUAGCAGUGGCUCGAUCUUUUCC  

例:

                                 String in File1                       String in  File2
                              AGGGCUUAGCUGCUUGUGAGCA                   UUCACAGUGGCUAAGUUCCGC
String in File3      CUGAGGAGCAGGGCUUAGCUGCUUGUGAGCAGGGUCCACACCAAGUCGUGUUCACAGUGGCUAAGUUCCGCCCCCCAG

此示例的输出:

GGGUCCACACCAAGUCGUG

在Perl中,您可以尝试以下代码:

use strict;
use warnings;

my $file1 = "AGGGCUUAGCUGCUUGUGAGCA";
my $file2 = "UUCACAGUGGCUAAGUUCCGC";
my $file3 = "CUGAGGAGCAGGGCUUAGCUGCUUGUGAGCAGGGUCCACACCAAGUCGUGUUCACAGUGGCUAAGUUCCGCCCCCCAG";

my ($result) = $file3 =~ /$file1(.*?)$file2/;

print $result;

输出:

GGGUCCACACCAAGUCGUG

这是R中的解决方案:

file1 <- "AGGGCUUAGCUGCUUGUGAGCA"
file2 <- "UUCACAGUGGCUAAGUUCCGC"
file3 <- "CUGAGGAGCAGGGCUUAGCUGCUUGUGAGCAGGGUCCACACCAAGUCGUGUUCACAGUGGCUAAGUUCCGCCCCCCAG"

# create a regular expression
pattern <- paste0(".*", file1, "(.*)", file2, ".*")

# extract the substring
sub(pattern, "\\1", file3)
# [1] "GGGUCCACACCAAGUCGUG"

python

>>> a='AGGGCUUAGCUGCUUGUGAGCA'
>>> b='UUCACAGUGGCUAAGUUCCGC'
>>> c='CUGAGGAGCAGGGCUUAGCUGCUUGUGAGCAGGGUCCACACCAAGUCGUGUUCACAGUGGCUAAGUUCCGCCCCCCAG'
>>> regex = a + '(.*?)' + b
>>> regex
'AGGGCUUAGCUGCUUGUGAGCA(.*?)UUCACAGUGGCUAAGUUCCGC'
>>> re.findall(regex,c)
['GGGUCCACACCAAGUCGUG']

用试试这个strapplyc在gsubfn。 我们假设只有一个s1s2实例,或者如果有多个实例需要s1第一个实例和s2的最后一个实例之间的字符串。 如果可能有多个实例,并且您想要不同的内容,请将此添加到问题中。

s1 <- "AGGGCUUAGCUGCUUGUGAGCA"
s2 <- "UUCACAGUGGCUAAGUUCCGC"
s3 <- "CUGAGGAGCAGGGCUUAGCUGCUUGUGAGCAGGGUCCACACCAAGUCGUGUUCACAGUGGCUAAGUUCCGCCCCCCAG"

library(gsubfn)
fn$strapplyc(s3, "$s1(.*)$s2", simplify = TRUE)
##  [1] "GGGUCCACACCAAGUCGUG"

在python中
`

string1 = "AGGGCUUAGCUGCUUGUGAGCA"
    string2 = "UUCACAGUGGCUAAGUUCCGC"
    string_main = "CUGAGGAGCAGGGCUUAGCUGCUUGUGAGCAGGGUCCACACCAAGUCGUGUUCACAGUGGCUAAGUUCCGCCCCCCAG"
    print string_main[string_main.find(string1)+len(string1):string_main.find(string2)]

基于您的给定输入,以下将起作用。

f1 <- "AGGGCUUAGCUGCUUGUGAGCA"
f2 <- "UUCACAGUGGCUAAGUUCCGC"
f3 <- "CUGAGGAGCAGGGCUUAGCUGCUUGUGAGCAGGGUCCACACCAAGUCGUGUUCACAGUGGCUAAGUUCCGCCCCCCAG"
strsplit(f3, paste(f1, f2, sep='|'))[[1]][2]
# [1] "GGGUCCACACCAAGUCGUG"

在R中使用qdapRegex

f1 <- "AGGGCUUAGCUGCUUGUGAGCA"
f2 <- "UUCACAGUGGCUAAGUUCCGC"
f3 <- "CUGAGGAGCAGGGCUUAGCUGCUUGUGAGCAGGGUCCACACCAAGUCGUGUUCACAGUGGCUAAGUUCCGCCCCCCAG"

library(qdapRegex)
rm_between(f3, f1, f2, extract=TRUE)

## [[1]]
## [1] "GGGUCCACACCAAGUCGUG"

顾名思义rm_between删除或抓取左右边界之间的项目。 使用extract = TRUE来获取边界之间的字符串。 返回的值是一个列表,因为每个字符串可能有多个提取。 如果这是不合需要的,那么在unlist(rm_between(f3, f1, f2, extract=TRUE))使用unlist unlist(rm_between(f3, f1, f2, extract=TRUE))

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM