简体   繁体   English

提取子串模式

[英]extract substring pattern

I have long file like 1200 sequences 我有1200个序列之类的长文件

>3fm8|A|A0JLQ2
CFLVNLNADPALNELLVYYLKEHTLIGSANSQDIQLCGMGILPEHCIIDITSEGQVMLTP
QKNTRTFVNGSSVSSPIQLHHGDRILWGNNHFFRLNLP


>2ht9|A|A0JLT0
LATAPVNQIQETISDNCVVIFSKTSCSYCTMAKKLFHDMNVNYKVVELDLLEYGNQFQDA
LYKMTGERTVPRIFVNGTFIGGATDTHRLHKEGKLLPLVHQCYL

I want to read each possible pattern has cysteine in middle and has in the beginning five string and follow by other five string such as xxxxxCxxxxx 我想读取每个可能的模式,中间有半胱氨酸,开头有五个字符串,之后是其他五个字符串,例如xxxxxCxxxxx

the output should be like this: 输出应该是这样的:

  • QDIQLCGMGIL QDIQLCGMGIL
  • ILPEHCIIDIT ILPEHCIIDIT
  • TISDNCVVIFS TISDNCVVIFS
  • FSKTSCSYCTM FSKTSCSYCTM

this is the pogram only give position of C . 这是仅给出C位置的图表。 it is not work like what I want 它不是我想要的那样工作

pos=[]

def find(ch,string1):

    for i in range(len(string1)):
        if ch == string1[i]:
            pos.append(i)
            return pos



z=find('C','AWERQRTCWERTYCTAAAACTTCTTT')

print z

You need to return outside the loop , you are returning on the first match so you only ever get a single character in your list: 您需要返回循环之外 ,您要在第一个匹配项上返回,因此您只会在列表中得到一个字符:

def find(ch,string1):  
    pos = []
    for i in range(len(string1)):
        if ch == string1[i]:
            pos.append(i)
    return pos # outside

You can also use enumerate with a list comp in place of your range logic: 您还可以在列表组合中使用枚举来代替范围逻辑:

def indexes(ch, s1):  
    return [index for index, char in enumerate(s1)if char == ch and 5 >= index <= len(s1) - 6]

Each index in the list comp is the character index and each char is the actual character so we keep each index where char is equal to ch. list comp中的每个index都是字符索引,每个char是实际字符,因此我们将每个索引保留在char等于ch的位置。

If you want the five chars that are both sides: 如果要同时使用两个字符:

In [24]: s="CFLVNLNADPALNELLVYYLKEHTLIGSANSQDIQLCGMGILPEHCIIDITSEGQVMLTP QKNTRTFVNGSSVSSPIQLHHGDRILWGNNHFFRLNLP"

In [25]: inds = indexes("C",s)

In [26]: [s[i-5:i+6] for i in inds]
Out[26]: ['QDIQLCGMGIL', 'ILPEHCIIDIT']

I added checking the index as we obviously cannot get five chars before C if the index is < 5 and the same from the end. 我添加了检查索引,因为如果索引<5且末尾相同,我们显然不能在C之前获得五个字符。

You can do it all in a single function, yielding a slice when you find a match: 您可以在一个函数中完成所有操作,找到匹配项时会产生一个切片:

def find(ch, s):
    ln = len(s)
    for i, char in enumerate(s):
        if ch == char and 5 <= i <= ln - 6:
            yield s[i- 5:i + 6]

Where presuming the data in your question is actually two lines from yoru file like: 假设问题中的数据实际上是来自yoru文件的两行,例如:

s="""">3fm8|A|A0JLQ2CFLVNLNADPALNELLVYYLKEHTLIGSANSQDIQLCGMGILPEHCIIDITSEGQVMLTPQKNTRTFVNGSSVSSPIQLHHGDRILWGNNHFFRLNLP
>2ht9|A|A0JLT0LATAPVNQIQETISDNCVVIFSKTSCSYCTMAKKLFHDMNVNYKVVELDLLEYGNQFQDALYKMTGERTVPRIFVNGTFIGGATDTHRLHKEGKLLPLVHQCY"""

Running: 运行:

for line in s.splitlines():
    print(list(find("C" ,line)))

would output: 将输出:

['0JLQ2CFLVNL', 'QDIQLCGMGIL', 'ILPEHCIIDIT']
['TISDNCVVIFS', 'FSKTSCSYCTM', 'TSCSYCTMAKK']

Which gives six matches not four as your expected output suggest so I presume you did not include all possible matches. 这给出了六个匹配而不是预期输出所建议的四个,因此我想您并未包括所有可能的匹配。

You can also speed up the code using str.find , starting at the last match index + 1 for each subsequent match 您也可以使用str.find加快代码的执行速度,从最后一个匹配索引+1开始,每个后续匹配

def find(ch, s):
    ln, i = len(s) - 6, s.find(ch)
    while 5 <= i <= ln:
        yield s[i - 5:i + 6]
        i = s.find(ch, i + 1)

Which will give the same output. 这将给出相同的输出。 Of course if the strings cannot overlap you can start looking for the next match much further in the string each time. 当然,如果字符串不能重叠,则可以每次在字符串中进一步查找下一个匹配项。

My solution is based on regex, and shows all possible solutions using regex and while loop. 我的解决方案基于正则表达式,并显示了使用正则表达式和while循环的所有可能解决方案。 Thanks to @Smac89 for improving it by transforming it into a generator: 感谢@ Smac89通过将其转换为生成器来改进它:

import re

string = """CFLVNLNADPALNELLVYYLKEHTLIGSANSQDIQLCGMGILPEHCIIDITSEGQVMLTPQKNTRTFVNGSSVSSPIQLHHGDRILWGNNHFFRLNLP

LATAPVNQIQETISDNCVVIFSKTSCSYCTMAKKLFHDMNVNYKVVELDLLEYGNQFQDA LYKMTGERTVPRIFVNGTFIGGATDTHRLHKEGKLLPLVHQCYL"""

# Generator
def find_cysteine2(string):

    # Create a loop that will utilize regex multiple times
    # in order to capture matches within groups
    while True:
        # Find a match
        data = re.search(r'(\w{5}C\w{5})',string)

        # If match exists, let's collect the data
        if data:
            # Collect the string
            yield data.group(1)

            # Shrink the string to not include 
            # the previous result
            location = data.start() + 1
            string = string[location:]

        # If there are no matches, stop the loop
        else:
            break

print [x for x in find_cysteine2(string)]
# ['QDIQLCGMGIL', 'ILPEHCIIDIT', 'TISDNCVVIFS', 'FSKTSCSYCTM', 'TSCSYCTMAKK']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM