![](/img/trans.png)
[英]how to extract portion of a string between two substrings in a multiline string in python
[英]Extract part of string that matches between two substrings
我有三个包含一组字符串的文件。 File1和File2包含File3的子字符串。 我想从位于File1和File2中的子串之间的File3中减去字符串。 请看下面的例子:
File1(substring 1):
head(fivep$V2)
[1] UGAGGUAGUAGUUUGUACAGUU UGAGGUAGUAGUUUGUGCUGUU ACAUACUUCUUUAUAUGCCCAUA UAGCAGCACAUCAUGGUUUACA
[5] GGGUUCCUGGCAUGCUGAUUU AGAGCUUAGCUGAUUGGUGAAC
File2(子串2)
head(threep$V2)
[1] ACUGUACAGGCCACUGCCUUGC CUGCGCAAGCUACUGCCUUGCU UGGAAUGUAAAGAAGUAUGUAU CGAAUCAUUAUUUGCUGCUCUA
[5] AUCACAUUGCCAGGGAUUACC UUCACAGUGGCUAAGUUCUGC
文件3
head(hairpin$V2)
[1] UACACUGUGGAUCCGGUGAGGUAGUAGGUUGUAUAGUUUGGAAUAUUACCACCGGUGAACUAUGCAAUUUUCUACCUUACCGGAGACAGAACUCUUCGA
[2] AUGCUUCCGGCCUGUUCCCUGAGACCUCAAGUGUGAGUGUACUAUUGAUGCUUCACACCUGGGCUCUCCGGGUACCAGGACGGUUUGAGCAGAU
[3] AAAGUGACCGUACCGAGCUGCAUACUUCCUUACAUGCCCAUACUAUAUCAUAAAUGGAUAUGGAAUGUAAAGAAGUAUGUAGAACGGGGUGGUAGU
[4] UAAACAGUAUACAGAAAGCCAUCAAAGCGGUGGUUGAUGUGUUGCAAAUUAUGACUUUCAUAUCACAGCCAGCUUUGAUGUGCUGCCUGUUGCACUGU
[5] CGGACAAUGCUCGAGAGGCAGUGUGGUUAGCUGGUUGCAUAUUUCCUUGACAACGGCUACCUUCACUGCCACCCCGAACAUGUCGUCCAUCUUUGAA
[6] UCUCGGAUCAGAUCGAGCCAUUGCUGGUUUCUUCCACAGUGGUACUUUCCAUUAGAACUAUCACCGGGUGGAAACUAGCAGUGGCUCGAUCUUUUCC
例:
String in File1 String in File2
AGGGCUUAGCUGCUUGUGAGCA UUCACAGUGGCUAAGUUCCGC
String in File3 CUGAGGAGCAGGGCUUAGCUGCUUGUGAGCAGGGUCCACACCAAGUCGUGUUCACAGUGGCUAAGUUCCGCCCCCCAG
此示例的输出:
GGGUCCACACCAAGUCGUG
在Perl中,您可以尝试以下代码:
use strict;
use warnings;
my $file1 = "AGGGCUUAGCUGCUUGUGAGCA";
my $file2 = "UUCACAGUGGCUAAGUUCCGC";
my $file3 = "CUGAGGAGCAGGGCUUAGCUGCUUGUGAGCAGGGUCCACACCAAGUCGUGUUCACAGUGGCUAAGUUCCGCCCCCCAG";
my ($result) = $file3 =~ /$file1(.*?)$file2/;
print $result;
输出:
GGGUCCACACCAAGUCGUG
这是R中的解决方案:
file1 <- "AGGGCUUAGCUGCUUGUGAGCA"
file2 <- "UUCACAGUGGCUAAGUUCCGC"
file3 <- "CUGAGGAGCAGGGCUUAGCUGCUUGUGAGCAGGGUCCACACCAAGUCGUGUUCACAGUGGCUAAGUUCCGCCCCCCAG"
# create a regular expression
pattern <- paste0(".*", file1, "(.*)", file2, ".*")
# extract the substring
sub(pattern, "\\1", file3)
# [1] "GGGUCCACACCAAGUCGUG"
在python
>>> a='AGGGCUUAGCUGCUUGUGAGCA'
>>> b='UUCACAGUGGCUAAGUUCCGC'
>>> c='CUGAGGAGCAGGGCUUAGCUGCUUGUGAGCAGGGUCCACACCAAGUCGUGUUCACAGUGGCUAAGUUCCGCCCCCCAG'
>>> regex = a + '(.*?)' + b
>>> regex
'AGGGCUUAGCUGCUUGUGAGCA(.*?)UUCACAGUGGCUAAGUUCCGC'
>>> re.findall(regex,c)
['GGGUCCACACCAAGUCGUG']
用试试这个strapplyc
在gsubfn。 我们假设只有一个s1
和s2
实例,或者如果有多个实例需要s1
第一个实例和s2
的最后一个实例之间的字符串。 如果可能有多个实例,并且您想要不同的内容,请将此添加到问题中。
s1 <- "AGGGCUUAGCUGCUUGUGAGCA"
s2 <- "UUCACAGUGGCUAAGUUCCGC"
s3 <- "CUGAGGAGCAGGGCUUAGCUGCUUGUGAGCAGGGUCCACACCAAGUCGUGUUCACAGUGGCUAAGUUCCGCCCCCCAG"
library(gsubfn)
fn$strapplyc(s3, "$s1(.*)$s2", simplify = TRUE)
## [1] "GGGUCCACACCAAGUCGUG"
在python中
`
string1 = "AGGGCUUAGCUGCUUGUGAGCA" string2 = "UUCACAGUGGCUAAGUUCCGC" string_main = "CUGAGGAGCAGGGCUUAGCUGCUUGUGAGCAGGGUCCACACCAAGUCGUGUUCACAGUGGCUAAGUUCCGCCCCCCAG" print string_main[string_main.find(string1)+len(string1):string_main.find(string2)]
基于您的给定输入,以下将起作用。
f1 <- "AGGGCUUAGCUGCUUGUGAGCA"
f2 <- "UUCACAGUGGCUAAGUUCCGC"
f3 <- "CUGAGGAGCAGGGCUUAGCUGCUUGUGAGCAGGGUCCACACCAAGUCGUGUUCACAGUGGCUAAGUUCCGCCCCCCAG"
strsplit(f3, paste(f1, f2, sep='|'))[[1]][2]
# [1] "GGGUCCACACCAAGUCGUG"
在R中使用qdapRegex
:
f1 <- "AGGGCUUAGCUGCUUGUGAGCA"
f2 <- "UUCACAGUGGCUAAGUUCCGC"
f3 <- "CUGAGGAGCAGGGCUUAGCUGCUUGUGAGCAGGGUCCACACCAAGUCGUGUUCACAGUGGCUAAGUUCCGCCCCCCAG"
library(qdapRegex)
rm_between(f3, f1, f2, extract=TRUE)
## [[1]]
## [1] "GGGUCCACACCAAGUCGUG"
顾名思义rm_between
删除或抓取左右边界之间的项目。 使用extract = TRUE
来获取边界之间的字符串。 返回的值是一个列表,因为每个字符串可能有多个提取。 如果这是不合需要的,那么在unlist(rm_between(f3, f1, f2, extract=TRUE))
使用unlist
unlist(rm_between(f3, f1, f2, extract=TRUE))
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.