简体   繁体   English

调整滑动窗口 Python 生成器函数来随机播放窗口

[英]Adapting a sliding-window Python generator function to shuffle the window

I have adapted the sliding window generator function here ( https://scipher.wordpress.com/2010/12/02/simple-sliding-window-iterator-in-python/ ) for my needs.我已经根据我的需要调整了这里的滑动窗口生成器函数( https://scipher.wordpress.com/2010/12/02/simple-sliding-window-iterator-in-python/ )。 It is my first experience with generator functions so I did a lot of background reading.这是我第一次使用生成器函数,所以我做了很多背景阅读。 Given my (still) limited experience, I'm soliciting advice for the following problem:鉴于我(仍然)有限的经验,我正在就以下问题征求建议:

The code below does this: I use the sliding-window function to iterate over a 5,500-character string (DNA sequence with ~5,500 bp) in roughly 250-char windows with a step size of 1. For each chunk, I compare its GC content to a line in a 750-line file.下面的代码是这样做的:我使用滑动窗口函数在步长为 1 的大约 250 个字符的窗口中迭代 5,500 个字符的字符串(具有 ~5,500 bp 的 DNA 序列)。对于每个块,我比较它的 GC内容到 750 行文件中的一行。 (GC content is the percentage of the string elements that equal G or C). (GC 内容是等于 G 或 C 的字符串元素的百分比)。

However, for my downstream use I would really like to loop over these chunks randomly.但是,对于我的下游使用,我真的很想随机循环这些块。 From my Stack Overflow searching, I understand that it is not possible to shuffle a generator object, and that I cannot shuffle the windows inside the function because it actually searches the windows one at a time, returning to the function for the next chunk because of that "yield".从我的 Stack Overflow 搜索中,我了解到无法对生成器对象进行混洗,并且我无法对函数内的窗口进行混洗,因为它实际上一次搜索一个窗口,由于以下原因返回到下一个块的函数那个“产量”。 (Please correct me if I've misunderstood). (如果我误解了,请纠正我)。

Currently, my code looks something like this (using the generator function in the link above, of course):目前,我的代码看起来像这样(当然,使用上面链接中的生成器函数):

with open('/pathtofile/file.txt') as f:
    for line in f:
        line = line.rstrip()
        # For each target, grab target length (7), gc content (8)
        targ_length = line.split("\t")[8]
        gc = int(line.split("\t")[7])
        # Window size = amplicon length minus length of fwd and rev primers
        # Use a sliding window function to go along "my_seq" (5,500bp sequence). Check GC content for each window.
        chunks = slidingWindow(my_seq, targ_length, step=1)
        found = 0
        for i in chunks:
            # When GC content = same as file, save this window as the pos ctrl fragment & add primers to it
            dna_list = list(i)
            gc_count = dna_list.count("G") + dna_list.count("C")
            gc_frac = int((gc_count / len(dna_list)) * 100)
            # if (gc - 5) < gc_frac < (gc + 5):
            if gc_frac == gc:
                found = 1
                # Store this piece
                break
        if found == 0:
            # Store some info to look up later 

Anyone have ideas for the best approach?有人对最佳方法有想法吗? To me the most obvious (also based on Stack Overflow searches) is to re-write it without a generator function.对我来说最明显的(也是基于 Stack Overflow 搜索)是在没有生成器函数的情况下重写它。 I'm concerned about looping 750 times over a list containing roughly 5,251 elements.我担心在包含大约 5,251 个元素的列表上循环 750 次。 Should I be?我可以做? Generators seem like an elegant solution to what I want to do, except now that I've decided I want to randomize the chunk order.生成器似乎是我想要做的事情的优雅解决方案,除非现在我决定要随机化块顺序。 It seems clear I need to sacrifice efficiency to do this, but I'm wondering whether more experienced coders have some clever solutions.显然我需要牺牲效率来做到这一点,但我想知道更有经验的编码员是否有一些聪明的解决方案。 Thanks!谢谢!

I'm not an extremely experienced coder (but I am in the biological sciences), but I have a few questions:我不是一个非常有经验的编码员(但我在生物科学领域),但我有几个问题:

  1. Will the GC percent you are comparing your sliding window to always be the same?您比较滑动窗口的 GC 百分比是否始终相同?
  2. Do you still want to iterate over your sequence the same way you are currently doing it?您是否仍想以与当前相同的方式迭代您的序列? In other words, is the only thing you want to change is the order that the generator yields your answer?换句话说,您唯一要更改的是生成器生成答案的顺序吗? If so, you could do something like this如果是这样,你可以做这样的事情

    import random chunks = [my_seq[i:i+targ_length] for i in range(len(seq))] random.shuffle(chunks)

Im not sure I'm answering your question correctly, because I'm not 100% sure what its asking.我不确定我是否正确回答了你的问题,因为我不是 100% 确定它在问什么。

I believe that you're correct that you can't shuffle the output of a generator, but it would be relatively easy to randomize how it actually generates its output.我相信您无法改变生成器的输出是正确的,但是随机化它实际生成输出的方式相对容易。 Here's a modified version of the slidingWindow generator function that uses the numpy module to randomize (and set an optional seed):这是使用numpy模块进行随机化(并设置可选种子)的slidingWindow生成器函数的修改版本:

import numpy as np
def slidingWindow(sequence,winSize,step=1, seed=987):
    """Returns a generator that will iterate through
    the defined chunks of input sequence.  Input sequence
    must be iterable."""

    # Verify the inputs
    try:
        it = iter(sequence)
    except TypeError:
        raise Exception("**ERROR** sequence must be iterable.")
    if not ((type(winSize) == type(0)) and (type(step) == type(0))):
        raise Exception("**ERROR** type(winSize) and type(step) must be int.")
    if step > winSize:
        raise Exception("**ERROR** step must not be larger than winSize.")
    if winSize > len(sequence):
        raise Exception("**ERROR** winSize must not be larger than sequence length.")

    # set the seed for the pseudo-random number generator
    np.random.seed(seed)

    # Pre-compute number of chunks to emit
    numOfChunks = int(((len(sequence)-winSize)/step)+1)

    # Create a shuffled index of start points
    idx = np.arange(numOfChunks)
    np.random.shuffle(idx)

    # Do the work
    for i in range(0,numOfChunks*step,step):
        start_idx = idx[i]
        stop_idx = idx[i] + winSize
        yield sequence[start_idx:stop_idx]

Then you can either keep your main code as is, or modify how you create the chunks to set a different seed:然后,您可以保持主代码不变,也可以修改创建块的方式以设置不同的种子:

chunks = slidingWindow(my_seq, targ_length, step=1, seed=987)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM