I have adapted the sliding window generator function here ( https://scipher.wordpress.com/2010/12/02/simple-sliding-window-iterator-in-python/ ) for my needs. It is my first experience with generator functions so I did a lot of background reading. Given my (still) limited experience, I'm soliciting advice for the following problem:
The code below does this: I use the sliding-window function to iterate over a 5,500-character string (DNA sequence with ~5,500 bp) in roughly 250-char windows with a step size of 1. For each chunk, I compare its GC content to a line in a 750-line file. (GC content is the percentage of the string elements that equal G or C).
However, for my downstream use I would really like to loop over these chunks randomly. From my Stack Overflow searching, I understand that it is not possible to shuffle a generator object, and that I cannot shuffle the windows inside the function because it actually searches the windows one at a time, returning to the function for the next chunk because of that "yield". (Please correct me if I've misunderstood).
Currently, my code looks something like this (using the generator function in the link above, of course):
with open('/pathtofile/file.txt') as f:
for line in f:
line = line.rstrip()
# For each target, grab target length (7), gc content (8)
targ_length = line.split("\t")[8]
gc = int(line.split("\t")[7])
# Window size = amplicon length minus length of fwd and rev primers
# Use a sliding window function to go along "my_seq" (5,500bp sequence). Check GC content for each window.
chunks = slidingWindow(my_seq, targ_length, step=1)
found = 0
for i in chunks:
# When GC content = same as file, save this window as the pos ctrl fragment & add primers to it
dna_list = list(i)
gc_count = dna_list.count("G") + dna_list.count("C")
gc_frac = int((gc_count / len(dna_list)) * 100)
# if (gc - 5) < gc_frac < (gc + 5):
if gc_frac == gc:
found = 1
# Store this piece
break
if found == 0:
# Store some info to look up later
Anyone have ideas for the best approach? To me the most obvious (also based on Stack Overflow searches) is to re-write it without a generator function. I'm concerned about looping 750 times over a list containing roughly 5,251 elements. Should I be? Generators seem like an elegant solution to what I want to do, except now that I've decided I want to randomize the chunk order. It seems clear I need to sacrifice efficiency to do this, but I'm wondering whether more experienced coders have some clever solutions. Thanks!
I'm not an extremely experienced coder (but I am in the biological sciences), but I have a few questions:
Do you still want to iterate over your sequence the same way you are currently doing it? In other words, is the only thing you want to change is the order that the generator yields your answer? If so, you could do something like this
import random chunks = [my_seq[i:i+targ_length] for i in range(len(seq))] random.shuffle(chunks)
Im not sure I'm answering your question correctly, because I'm not 100% sure what its asking.
I believe that you're correct that you can't shuffle the output of a generator, but it would be relatively easy to randomize how it actually generates its output. Here's a modified version of the slidingWindow
generator function that uses the numpy
module to randomize (and set an optional seed):
import numpy as np
def slidingWindow(sequence,winSize,step=1, seed=987):
"""Returns a generator that will iterate through
the defined chunks of input sequence. Input sequence
must be iterable."""
# Verify the inputs
try:
it = iter(sequence)
except TypeError:
raise Exception("**ERROR** sequence must be iterable.")
if not ((type(winSize) == type(0)) and (type(step) == type(0))):
raise Exception("**ERROR** type(winSize) and type(step) must be int.")
if step > winSize:
raise Exception("**ERROR** step must not be larger than winSize.")
if winSize > len(sequence):
raise Exception("**ERROR** winSize must not be larger than sequence length.")
# set the seed for the pseudo-random number generator
np.random.seed(seed)
# Pre-compute number of chunks to emit
numOfChunks = int(((len(sequence)-winSize)/step)+1)
# Create a shuffled index of start points
idx = np.arange(numOfChunks)
np.random.shuffle(idx)
# Do the work
for i in range(0,numOfChunks*step,step):
start_idx = idx[i]
stop_idx = idx[i] + winSize
yield sequence[start_idx:stop_idx]
Then you can either keep your main code as is, or modify how you create the chunks to set a different seed:
chunks = slidingWindow(my_seq, targ_length, step=1, seed=987)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.