简体   繁体   中英

Adapting a sliding-window Python generator function to shuffle the window

I have adapted the sliding window generator function here ( https://scipher.wordpress.com/2010/12/02/simple-sliding-window-iterator-in-python/ ) for my needs. It is my first experience with generator functions so I did a lot of background reading. Given my (still) limited experience, I'm soliciting advice for the following problem:

The code below does this: I use the sliding-window function to iterate over a 5,500-character string (DNA sequence with ~5,500 bp) in roughly 250-char windows with a step size of 1. For each chunk, I compare its GC content to a line in a 750-line file. (GC content is the percentage of the string elements that equal G or C).

However, for my downstream use I would really like to loop over these chunks randomly. From my Stack Overflow searching, I understand that it is not possible to shuffle a generator object, and that I cannot shuffle the windows inside the function because it actually searches the windows one at a time, returning to the function for the next chunk because of that "yield". (Please correct me if I've misunderstood).

Currently, my code looks something like this (using the generator function in the link above, of course):

with open('/pathtofile/file.txt') as f:
    for line in f:
        line = line.rstrip()
        # For each target, grab target length (7), gc content (8)
        targ_length = line.split("\t")[8]
        gc = int(line.split("\t")[7])
        # Window size = amplicon length minus length of fwd and rev primers
        # Use a sliding window function to go along "my_seq" (5,500bp sequence). Check GC content for each window.
        chunks = slidingWindow(my_seq, targ_length, step=1)
        found = 0
        for i in chunks:
            # When GC content = same as file, save this window as the pos ctrl fragment & add primers to it
            dna_list = list(i)
            gc_count = dna_list.count("G") + dna_list.count("C")
            gc_frac = int((gc_count / len(dna_list)) * 100)
            # if (gc - 5) < gc_frac < (gc + 5):
            if gc_frac == gc:
                found = 1
                # Store this piece
                break
        if found == 0:
            # Store some info to look up later 

Anyone have ideas for the best approach? To me the most obvious (also based on Stack Overflow searches) is to re-write it without a generator function. I'm concerned about looping 750 times over a list containing roughly 5,251 elements. Should I be? Generators seem like an elegant solution to what I want to do, except now that I've decided I want to randomize the chunk order. It seems clear I need to sacrifice efficiency to do this, but I'm wondering whether more experienced coders have some clever solutions. Thanks!

I'm not an extremely experienced coder (but I am in the biological sciences), but I have a few questions:

  1. Will the GC percent you are comparing your sliding window to always be the same?
  2. Do you still want to iterate over your sequence the same way you are currently doing it? In other words, is the only thing you want to change is the order that the generator yields your answer? If so, you could do something like this

    import random chunks = [my_seq[i:i+targ_length] for i in range(len(seq))] random.shuffle(chunks)

Im not sure I'm answering your question correctly, because I'm not 100% sure what its asking.

I believe that you're correct that you can't shuffle the output of a generator, but it would be relatively easy to randomize how it actually generates its output. Here's a modified version of the slidingWindow generator function that uses the numpy module to randomize (and set an optional seed):

import numpy as np
def slidingWindow(sequence,winSize,step=1, seed=987):
    """Returns a generator that will iterate through
    the defined chunks of input sequence.  Input sequence
    must be iterable."""

    # Verify the inputs
    try:
        it = iter(sequence)
    except TypeError:
        raise Exception("**ERROR** sequence must be iterable.")
    if not ((type(winSize) == type(0)) and (type(step) == type(0))):
        raise Exception("**ERROR** type(winSize) and type(step) must be int.")
    if step > winSize:
        raise Exception("**ERROR** step must not be larger than winSize.")
    if winSize > len(sequence):
        raise Exception("**ERROR** winSize must not be larger than sequence length.")

    # set the seed for the pseudo-random number generator
    np.random.seed(seed)

    # Pre-compute number of chunks to emit
    numOfChunks = int(((len(sequence)-winSize)/step)+1)

    # Create a shuffled index of start points
    idx = np.arange(numOfChunks)
    np.random.shuffle(idx)

    # Do the work
    for i in range(0,numOfChunks*step,step):
        start_idx = idx[i]
        stop_idx = idx[i] + winSize
        yield sequence[start_idx:stop_idx]

Then you can either keep your main code as is, or modify how you create the chunks to set a different seed:

chunks = slidingWindow(my_seq, targ_length, step=1, seed=987)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM