简体   繁体   中英

Iterate through lines and compare in Python

I am analyzing genome sequencing data and having a problem that I can't identify. I am using an input fastq file containing about 5 million sequence reads as shown here:

1    Unique Header    #Read 1
2    AAAAAA.....AAAAAA    #Sequence Read 1
3    +
4    ??AA@F    #Quality of Read 1
5    Unique Header    #Read 2
6    ATTAAA.....AAAAAA
7    +
8    >>AA?B
9    Unique Header    #Read 3
10   ATAAAA.....AAAAAA
11   +
12   >>AA?B

The idea is to then iterate through this file and compare the Sequence read lines (lines 2 and 6 above). If the first and last six chars of a sequence are unique enough (Levenshtein distance of 2) then the full sequence and its corresponding three lines are written to an output file. Otherwise, it is ignored.

My code appears to do this when I use a small test file, but when I then analyze a full fastq file, it appears that two many sequences are written to the output file.

I have my code below, any help would be appreciated. Thanks

Code:

def outputFastqSimilar():
    target = open(output_file, 'w') #Final output file that will contain only matching acceptable reads/corresponding data

    with open(current_file, 'r') as f: #This is the input fastq
        lineCharsList = [] #Contains unique concatenated strings of first and last 6 chars of read line

        headerLine = next(f)  #Stores the header information for each line
        counter = 1

        for line in f:
            if counter == 1: 
                lineChars = line[0:6]+line[145:151] #Identify first concatenated string of first/last 6 chars
                lineCharsList.append(lineChars)

                #Write first read/info to output
                target.write(headerLine)
                target.write(line)
                nextLine = next(f)
                target.write(nextLine)
                nextLine = next(f)
                target.write(nextLine)
                headerLine = next(f)    #Move to next header
                counter+=1

            elif counter > 1:
                lineChars = line[0:6]+line[145:151] #Get first/last six chars from next read

                different_enough = True
                for i in lineCharsList: #Iterate through list and compare with current read
                    if distance(lineChars, i) < 2: #Levenshtein distance
                        different_enough = False
                        for skip in range(3): #If read too similar, skip over it
                            try:
                                check = line #Check for additional lines in file
                                headerLine = next(f) #Move to next header
                            except StopIteration:
                                break

                    elif distance(lineChars, i) >= dist_stringency & different_enough == True: #If read is unique enough, write to output
                        lineCharsList.append(lineChars)
                        target.write(headerLine)
                        target.write(line)
                        nextLine = next(f)
                        target.write(nextLine)
                        nextLine = next(f)
                        target.write(nextLine)
                        try:
                            check = line
                            headerLine = next(f)
                        except StopIteration:
                            break

    target.close()

Desired output of test file would be the following, where all reads are unique, but the read on line 10 has a Levenshtein distance < 2 to the read on line 2 so would not be included in the output:

1    Unique Header    #Read 1
2    AAAAAA.....AAAAAA    #Sequence Read 1
3    +
4    ??AA@F    #Quality of Read 1
5    Unique Header    #Read 2
6    ATTAAA.....AAAAAA
7    +
8    >>AA?B

It looks like you're testing whether each read is different enough from any previous read, but what you really want is the set of reads that are different from all previous reads.

You could set a flag different_enough = True before you enter this loop: for i in lineCharsList:

Then, when you test if distance(lineChars, i) < 2 set it to different_enough = False .

Don't print out anything inside the loop, wait until after it has completed and then check the status of different_enough . If your read passed every comparison it will still be True, so print out the read. If if even one read was too similar it will be False.

That way you'll only print the read if it passed every comparison.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM