简体   繁体   English

Python:将所有列表项与字符串的子字符串进行比较

[英]Python:Comparing all list items to substring of string

All the items in the list should be compared to the every 50 long substring of a string. 列表中的所有项目都应与字符串的每50个长子字符串进行比较。 The code I have written is working smaller string lengths but if string is very large(eg:8800) its not. 我编写的代码工作的字符串长度较小,但是如果字符串很大(例如:8800),则不是。 Can anyone suggest a better way or debug the code? 任何人都可以提出更好的方法或调试代码吗?

Code: 码:

a_str = 'CGGACTCGACAGATGTGAAGAACGACAATGTGAAGACTCGACACGACAGAGTGAAGAGAAGAGGAAACATTGTAA'
a = 0
b = 5
c = 50
leng = len(a_str)
lengb = leng - b + 1
list1 = []
list2 = []
list3 = []
list4 = []
for i in a_str[a:lengb]:
    findstr = a_str[a:b]
    if findstr not in list2:
        count = a_str.count(findstr)
        list1 = [m.start() for m in re.finditer(findstr, a_str)]
        last = list1[-1]
        first = list1[0]
        diff = last - first
        if diff > 45:
            count = count - 1
        if count > 3:
            list2.append(findstr)
            list3.append(list1)
    a += 1
    b += 1

a = 0
dictionary = dict(zip(list2, list3))
for j in list2:
    for k in a_str[a:c]:
        if c < leng:
            str1 = a_str[a:c]
            if str1.count(j) == 4:
                list4.append(j)
    a += 1
    c += 1

print(list4)

For a string which is 8800, b=10, count1=17, and c=588 long c is taking value only till 1161 during looping 对于8800的字符串,b = 10,count1 = 17和c = 588,则长c在循环期间仅取值直到1161

I need substring of length 5 repeated 4 times in a window length of 50(ie; for every 50 characters of the main string) 我需要在窗口长度为50(即,对于主字符串的每50个字符)中重复4次的长度为5的子字符串

I used comprehensions and sets to create a more understandable function. 我使用了理解和集合来创建一个更易于理解的函数。

def find_four_substrings(a_str, sub_len=5, window=50, occurs=4):
    '''
    Given a string of any length return the set of substrings
    of sub_length (default is 5) that exists exactly occurs 
    (default 4) times in the string, for a window (default 50)
    '''
    return set(a_str[i:i+sub_len] for i in range(len(a_str) - sub_len) 
                if a_str.count(a_str[i:i+sub_len], i, window) == occurs)

and

a_str = 'CGGACTCGACAGATGTGAAGAACGACAATGTGAAGACTCGACACGACAGAGTGAAGAGAAGAGGAAACATTGTAA'
print(find_four_substrings(a_str))

returns 退货

set(['CGACA'])

This finds all substrings of length 5 that are repeated at least 4 or more times (not overlapping) within 50 characters. 这将找到长度为5的所有子字符串,这些子字符串在50个字符内重复至少4次或更多次(不重叠)。 The resulting list does not have duplicates. 结果列表没有重复项。

a_str = 'CGGACTCGACAGATGTGAAGAACGACAATGTGAAGACTCGACACGACAGAGTGAAGAGAAGAGGAAACATTGTAA'
b = 5      #length of substring
c = 50     #length of window
repeat = 4 #minimum number of repetitions

substrings = list({
    a_str[i:i+b]
    for i in range(len(a_str) - b)
    if a_str.count(a_str[i:i+b], i+b, i+c) >= repeat - 1
})
print(substrings)

I believe this is what you want. 我相信这就是您想要的。 Let me know if otherwise. 否则请通知我。

['CGACA', 'GAAGA']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM