All the items in the list should be compared to the every 50 long substring of a string. The code I have written is working smaller string lengths but if string is very large(eg:8800) its not. Can anyone suggest a better way or debug the code?
Code:
a_str = 'CGGACTCGACAGATGTGAAGAACGACAATGTGAAGACTCGACACGACAGAGTGAAGAGAAGAGGAAACATTGTAA'
a = 0
b = 5
c = 50
leng = len(a_str)
lengb = leng - b + 1
list1 = []
list2 = []
list3 = []
list4 = []
for i in a_str[a:lengb]:
findstr = a_str[a:b]
if findstr not in list2:
count = a_str.count(findstr)
list1 = [m.start() for m in re.finditer(findstr, a_str)]
last = list1[-1]
first = list1[0]
diff = last - first
if diff > 45:
count = count - 1
if count > 3:
list2.append(findstr)
list3.append(list1)
a += 1
b += 1
a = 0
dictionary = dict(zip(list2, list3))
for j in list2:
for k in a_str[a:c]:
if c < leng:
str1 = a_str[a:c]
if str1.count(j) == 4:
list4.append(j)
a += 1
c += 1
print(list4)
For a string which is 8800, b=10, count1=17, and c=588 long c is taking value only till 1161 during looping
I need substring of length 5 repeated 4 times in a window length of 50(ie; for every 50 characters of the main string)
I used comprehensions and sets to create a more understandable function.
def find_four_substrings(a_str, sub_len=5, window=50, occurs=4):
'''
Given a string of any length return the set of substrings
of sub_length (default is 5) that exists exactly occurs
(default 4) times in the string, for a window (default 50)
'''
return set(a_str[i:i+sub_len] for i in range(len(a_str) - sub_len)
if a_str.count(a_str[i:i+sub_len], i, window) == occurs)
and
a_str = 'CGGACTCGACAGATGTGAAGAACGACAATGTGAAGACTCGACACGACAGAGTGAAGAGAAGAGGAAACATTGTAA'
print(find_four_substrings(a_str))
returns
set(['CGACA'])
This finds all substrings of length 5 that are repeated at least 4 or more times (not overlapping) within 50 characters. The resulting list does not have duplicates.
a_str = 'CGGACTCGACAGATGTGAAGAACGACAATGTGAAGACTCGACACGACAGAGTGAAGAGAAGAGGAAACATTGTAA'
b = 5 #length of substring
c = 50 #length of window
repeat = 4 #minimum number of repetitions
substrings = list({
a_str[i:i+b]
for i in range(len(a_str) - b)
if a_str.count(a_str[i:i+b], i+b, i+c) >= repeat - 1
})
print(substrings)
I believe this is what you want. Let me know if otherwise.
['CGACA', 'GAAGA']
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.