简体   繁体   English

遍历行并在Python中进行比较

[英]Iterate through lines and compare in Python

I am analyzing genome sequencing data and having a problem that I can't identify. 我正在分析基因组测序数据,遇到无法识别的问题。 I am using an input fastq file containing about 5 million sequence reads as shown here: 我正在使用一个包含大约500万个序列读取的输入fastq文件,如下所示:

1    Unique Header    #Read 1
2    AAAAAA.....AAAAAA    #Sequence Read 1
3    +
4    ??AA@F    #Quality of Read 1
5    Unique Header    #Read 2
6    ATTAAA.....AAAAAA
7    +
8    >>AA?B
9    Unique Header    #Read 3
10   ATAAAA.....AAAAAA
11   +
12   >>AA?B

The idea is to then iterate through this file and compare the Sequence read lines (lines 2 and 6 above). 然后,想法是遍历此文件并比较Sequence读取行(上面的第2行和第6行)。 If the first and last six chars of a sequence are unique enough (Levenshtein distance of 2) then the full sequence and its corresponding three lines are written to an output file. 如果序列的前六个字符足够唯一(Levenshtein距离为2),则将整个序列及其相应的三行写入输出文件。 Otherwise, it is ignored. 否则,它将被忽略。

My code appears to do this when I use a small test file, but when I then analyze a full fastq file, it appears that two many sequences are written to the output file. 当我使用一个小的测试文件时,我的代码似乎可以执行此操作,但是当我分析一个完整的fastq文件时,似乎有两个序列被写入了输出文件。

I have my code below, any help would be appreciated. 我的代码如下,任何帮助将不胜感激。 Thanks 谢谢

Code: 码:

def outputFastqSimilar():
    target = open(output_file, 'w') #Final output file that will contain only matching acceptable reads/corresponding data

    with open(current_file, 'r') as f: #This is the input fastq
        lineCharsList = [] #Contains unique concatenated strings of first and last 6 chars of read line

        headerLine = next(f)  #Stores the header information for each line
        counter = 1

        for line in f:
            if counter == 1: 
                lineChars = line[0:6]+line[145:151] #Identify first concatenated string of first/last 6 chars
                lineCharsList.append(lineChars)

                #Write first read/info to output
                target.write(headerLine)
                target.write(line)
                nextLine = next(f)
                target.write(nextLine)
                nextLine = next(f)
                target.write(nextLine)
                headerLine = next(f)    #Move to next header
                counter+=1

            elif counter > 1:
                lineChars = line[0:6]+line[145:151] #Get first/last six chars from next read

                different_enough = True
                for i in lineCharsList: #Iterate through list and compare with current read
                    if distance(lineChars, i) < 2: #Levenshtein distance
                        different_enough = False
                        for skip in range(3): #If read too similar, skip over it
                            try:
                                check = line #Check for additional lines in file
                                headerLine = next(f) #Move to next header
                            except StopIteration:
                                break

                    elif distance(lineChars, i) >= dist_stringency & different_enough == True: #If read is unique enough, write to output
                        lineCharsList.append(lineChars)
                        target.write(headerLine)
                        target.write(line)
                        nextLine = next(f)
                        target.write(nextLine)
                        nextLine = next(f)
                        target.write(nextLine)
                        try:
                            check = line
                            headerLine = next(f)
                        except StopIteration:
                            break

    target.close()

Desired output of test file would be the following, where all reads are unique, but the read on line 10 has a Levenshtein distance < 2 to the read on line 2 so would not be included in the output: 测试文件的所需输出如下所示,其中所有读取都是唯一的,但是第10行的读取到第2行的读取的Levenshtein距离<2,因此不会包含在输出中:

1    Unique Header    #Read 1
2    AAAAAA.....AAAAAA    #Sequence Read 1
3    +
4    ??AA@F    #Quality of Read 1
5    Unique Header    #Read 2
6    ATTAAA.....AAAAAA
7    +
8    >>AA?B

It looks like you're testing whether each read is different enough from any previous read, but what you really want is the set of reads that are different from all previous reads. 它看起来像你测试每个读取是否是从以前读够了不同的,但你真正想要的是一组读取不同以前的所有读取。

You could set a flag different_enough = True before you enter this loop: for i in lineCharsList: 您可以在进入此循环之前设置标志different_enough = Truefor i in lineCharsList:

Then, when you test if distance(lineChars, i) < 2 set it to different_enough = False . 然后,当您测试distance(lineChars, i) < 2将其设置为different_enough = False

Don't print out anything inside the loop, wait until after it has completed and then check the status of different_enough . 不要在循环内打印出任何内容,请等它完成后再检查different_enough的状态。 If your read passed every comparison it will still be True, so print out the read. 如果您的读物通过了所有比较,则仍然为True,因此请打印出读物。 If if even one read was too similar it will be False. 如果哪怕一读太相似,也将为False。

That way you'll only print the read if it passed every comparison. 这样,仅当读取通过每个比较时,才打印该读取。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM