遍歷行並在Python中進行比較

Question

我正在分析基因組測序數據，遇到無法識別的問題。 我正在使用一個包含大約500萬個序列讀取的輸入fastq文件，如下所示：

1    Unique Header    #Read 1
2    AAAAAA.....AAAAAA    #Sequence Read 1
3    +
4    ??AA@F    #Quality of Read 1
5    Unique Header    #Read 2
6    ATTAAA.....AAAAAA
7    +
8    >>AA?B
9    Unique Header    #Read 3
10   ATAAAA.....AAAAAA
11   +
12   >>AA?B

然后，想法是遍歷此文件並比較Sequence讀取行（上面的第2行和第6行）。 如果序列的前六個字符足夠唯一（Levenshtein距離為2），則將整個序列及其相應的三行寫入輸出文件。 否則，它將被忽略。

當我使用一個小的測試文件時，我的代碼似乎可以執行此操作，但是當我分析一個完整的fastq文件時，似乎有兩個序列被寫入了輸出文件。

我的代碼如下，任何幫助將不勝感激。 謝謝

碼：

def outputFastqSimilar():
    target = open(output_file, 'w') #Final output file that will contain only matching acceptable reads/corresponding data

    with open(current_file, 'r') as f: #This is the input fastq
        lineCharsList = [] #Contains unique concatenated strings of first and last 6 chars of read line

        headerLine = next(f)  #Stores the header information for each line
        counter = 1

        for line in f:
            if counter == 1: 
                lineChars = line[0:6]+line[145:151] #Identify first concatenated string of first/last 6 chars
                lineCharsList.append(lineChars)

                #Write first read/info to output
                target.write(headerLine)
                target.write(line)
                nextLine = next(f)
                target.write(nextLine)
                nextLine = next(f)
                target.write(nextLine)
                headerLine = next(f)    #Move to next header
                counter+=1

            elif counter > 1:
                lineChars = line[0:6]+line[145:151] #Get first/last six chars from next read

                different_enough = True
                for i in lineCharsList: #Iterate through list and compare with current read
                    if distance(lineChars, i) < 2: #Levenshtein distance
                        different_enough = False
                        for skip in range(3): #If read too similar, skip over it
                            try:
                                check = line #Check for additional lines in file
                                headerLine = next(f) #Move to next header
                            except StopIteration:
                                break

                    elif distance(lineChars, i) >= dist_stringency & different_enough == True: #If read is unique enough, write to output
                        lineCharsList.append(lineChars)
                        target.write(headerLine)
                        target.write(line)
                        nextLine = next(f)
                        target.write(nextLine)
                        nextLine = next(f)
                        target.write(nextLine)
                        try:
                            check = line
                            headerLine = next(f)
                        except StopIteration:
                            break

    target.close()

測試文件的所需輸出如下所示，其中所有讀取都是唯一的，但是第10行的讀取到第2行的讀取的Levenshtein距離<2，因此不會包含在輸出中：

1    Unique Header    #Read 1
2    AAAAAA.....AAAAAA    #Sequence Read 1
3    +
4    ??AA@F    #Quality of Read 1
5    Unique Header    #Read 2
6    ATTAAA.....AAAAAA
7    +
8    >>AA?B

Answer 1

它看起來像你測試每個讀取是否是從以前的讀夠了不同的，但你真正想要的是一組讀取不同以前的所有讀取。

您可以在進入此循環之前設置標志different_enough = True ： for i in lineCharsList:

然后，當您測試distance(lineChars, i) < 2將其設置為different_enough = False 。

不要在循環內打印出任何內容，請等它完成后再檢查different_enough的狀態。 如果您的讀物通過了所有比較，則仍然為True，因此請打印出讀物。 如果哪怕一讀太相似，也將為False。

這樣，僅當讀取通過每個比較時，才打印該讀取。

遍歷行並在Python中進行比較

問題描述

1 個解決方案

解決方案1
1 2015-07-20 17:17:16

遍歷行並在Python中進行比較

問題描述

1 個解決方案

解決方案1 1 2015-07-20 17:17:16

解決方案1
1 2015-07-20 17:17:16