简体   繁体   English

如何在 python 3 中将字符串划分为子字符串

[英]How to divide a string into substrings in python 3

I need help finding a more effective way of separating a string into all possible substrings.我需要帮助找到一种更有效的方法将字符串分成所有可能的子字符串。 This program needs to take a desired string sequence (ex. GTCCAGCTTAAG), with a maximum length of 12, and a minimum of 6. This string needs to be separated by decreasing length (so ideally it would stop at 6).该程序需要采用所需的字符串序列(例如 GTCCAGCTTAAG),最大长度为 12,最小长度为 6。该字符串需要通过减小长度来分隔(因此理想情况下它将停止在 6 处)。 Also all the reversed counterparts need to be included in this (ex. for the 10 character length it would show GTCCAGCTTA and CAGGTCGAAT), I would imagine by a loop.此外,所有反向对应项都需要包含在其中(例如,对于 10 个字符长度,它将显示 GTCCAGCTTA 和 CAGGTCGAAT),我会想象一个循环。 Finally it would be put together in a list.最后,它会被放在一个列表中。

This is what I have come up with so far after trying many different, yet unsuccessful combinations.这是我在尝试了许多不同但不成功的组合后得出的结论。 This one returns me somewhat the closest, but still a mess:这个让我有点接近,但仍然一团糟:

tar = "GTCCAGCTTAAG"

def subs(tar):
    substring = []
    for i in range(len(tar)):
        string_portion = tar[:i + 1]  
        string_portion1 = tar[i:]     
        substring.append(string_portion)
        print(substring)
    return 

subs(tar)

I want to separate a string into all substrings of lengths 6-12.我想将一个字符串分成长度为 6-12 的所有子字符串。 I would also like to the original string and replace G/CA/T and find all substrings of that as well.我也想用原始字符串替换 G/CA/T 并找到它的所有子字符串。

input: ACTGACTG -->TGACTGAC输入:ACTGACTG -->TGACTGAC

output: [ACTGAC, ACTGACT, ACTGACTG, CTGACT..., TGACTG, TGACTGA, TGACTGAC, GACTGA...] output:[ACTGAC,ACTGACT,ACTGACTG,CTGACT...,TGACTG,TGACTGA,TGACTGAC,GACTGA...]

Then that output list would be sorted in decreasing length.然后 output 列表将按长度递减排序。

You could make a recursive generator using islice form itertools to zip through your string in parallel from offset starting positions and going down in size to a minimum of 6:您可以使用 islice form itertools 通过您的字符串从偏移起始位置并行生成 zip 的递归生成器,并将大小减小到至少 6:

from itertools import islice

pairing = str.maketrans("GCAT","CGTA")

def getSubs(A,subLen=11,minLen=6,inverse=None):
    subLen = min(subLen,len(A)-1)
    if subLen<minLen: return
    for s in map("".join,zip(*(islice(A,i,None) for i in range(subLen)))):
        if inverse is not True:  yield s       # substring of length minLen
        if inverse is not False: yield s.translate(pairing)  # it's inverse
    yield from getSubs(A,subLen-1,minLen) # shorter substring lengths

Here islice(A,i,None) is an iterator over the whole string starting at position i .这里islice(A,i,None)是从 position i开始的整个字符串的迭代器。 The *(... for i in range(subLen)) part creates subLen such iterators and feeds them to the zip() function. *(... for i in range(subLen))部分创建 subLen 这样的迭代器并将它们提供给 zip() function。 With each iterator starting one position farther than the previous one, zip() will produce tuples corresponding to all substrings of length subLen .随着每个迭代器开始一个比前一个更远的 position , zip() 将生成对应于所有长度为subLen的子字符串的元组。 This is mapped into "".join to turn the tuples back into strings.这被映射到"".join以将元组转换回字符串。

The substrings s are output using yield (this is a generator function) and the translate method is used to output the inverse of each substring as well.子串s是 output 使用 yield (这是一个生成器函数),并且translate方法用于 output 每个 substring 的倒数。

Once all substrings of length subLen are out, we recurse to the next shorter length until a length of 6 reached.一旦所有长度为subLen的子串都出来了,我们递归到下一个较短的长度,直到达到 6 的长度。

The advantage of using a generator function is that, if you apply this to very large strings and are merely searching for the first occurrence of a given pattern, then you can scan through the substrings without creating a huge list of them in memory.使用生成器 function 的优势在于,如果您将其应用于非常大的字符串并且只是搜索给定模式的第一次出现,那么您可以扫描子字符串而无需在 memory 中创建一个巨大的列表。 And, if needed, you can easily place the result in a list ( L = list(getSubs(tar)) ).而且,如果需要,您可以轻松地将结果放入列表 ( L = list(getSubs(tar)) )。

output: output:

tar = "GTCCAGCTTAAG"
for s in getSubs(tar):print(s)

# you can call getSubs(tar,inverse=True) to get only the inverted substrings
# or call getSubs(tar,inverse=False) to get only the non-inverted substrings
# when the inverse parameter is not provided, you get both

GTCCAGCTTAA
CAGGTCGAATT
TCCAGCTTAAG
AGGTCGAATTC
GTCCAGCTTA
CAGGTCGAAT
TCCAGCTTAA
AGGTCGAATT
CCAGCTTAAG
GGTCGAATTC
GTCCAGCTT
CAGGTCGAA
TCCAGCTTA
AGGTCGAAT
CCAGCTTAA
GGTCGAATT
CAGCTTAAG
GTCGAATTC
GTCCAGCT
CAGGTCGA
TCCAGCTT
AGGTCGAA
CCAGCTTA
GGTCGAAT
CAGCTTAA
GTCGAATT
AGCTTAAG
TCGAATTC
GTCCAGC
CAGGTCG
TCCAGCT
AGGTCGA
CCAGCTT
GGTCGAA
CAGCTTA
GTCGAAT
AGCTTAA
TCGAATT
GCTTAAG
CGAATTC
GTCCAG
CAGGTC
TCCAGC
AGGTCG
CCAGCT
GGTCGA
CAGCTT
GTCGAA
AGCTTA
TCGAAT
GCTTAA
CGAATT
CTTAAG
GAATTC

Using a pythonic way使用pythonic方式

>>> tar = "GTCCAGCTTAAG"

>>> k = 6
>>> K = 12

>>> res = [tar[i: j] for i in range(len(tar)) for j in range(i + 1, len(tar) + 1) if len(tar[i:j]) >= k and len(tar[i:j]) <= K]


>>> res
['GTCCAG', 'GTCCAGC', 'GTCCAGCT', 'GTCCAGCTT', 'GTCCAGCTTA', 'GTCCAGCTTAA', 'GTCCAGCTTAAG', 'TCCAGC', 'TCCAGCT', 'TCCAGCTT', 'TCCAGCTTA', 'TCCAGCTTAA', 'TCCAGCTTAAG', 'CCAGCT', 'CCAGCTT', 'CCAGCTTA', 'CCAGCTTAA', 'CCAGCTTAAG', 'CAGCTT', 'CAGCTTA', 'CAGCTTAA', 'CAGCTTAAG', 'AGCTTA', 'AGCTTAA', 'AGCTTAAG', 'GCTTAA', 'GCTTAAG', 'CTTAAG']

I am not sure i fully understand your problem, but I think I figured out a solution:我不确定我是否完全理解您的问题,但我想我找到了解决方案:

tar = "GTCCAGCTTAAG"

def subs(tar, substring = [], sorted_list = []):
  for i in range(len(tar)):
    for n in range(len(tar)):

   #Tries all possible solution and passes if len in lower than 6
   #also passes if substring is None

      if i >= n: 
        pass
      else:
        a = tar[i:n]
        if len(a) >= 6:
          substring.append(a)

  #Sorting the list
  for i in range(11):
    i +=6
    for sub in substring:
      if len(sub) == i:
        sorted_list.append(sub)
        
  return sorted_list
  

a = subs(tar)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM