I am analyzing genome sequencing data and having a problem that I can't identify. I am using an input fastq file containing about 5 million sequence reads as shown here:
1 Unique Header #Read 1
2 AAAAAA.....AAAAAA #Sequence Read 1
3 +
4 ??AA@F #Quality of Read 1
5 Unique Header #Read 2
6 ATTAAA.....AAAAAA
7 +
8 >>AA?B
9 Unique Header #Read 3
10 ATAAAA.....AAAAAA
11 +
12 >>AA?B
The idea is to then iterate through this file and compare the Sequence read lines (lines 2 and 6 above). If the first and last six chars of a sequence are unique enough (Levenshtein distance of 2) then the full sequence and its corresponding three lines are written to an output file. Otherwise, it is ignored.
My code appears to do this when I use a small test file, but when I then analyze a full fastq file, it appears that two many sequences are written to the output file.
I have my code below, any help would be appreciated. Thanks
Code:
def outputFastqSimilar():
target = open(output_file, 'w') #Final output file that will contain only matching acceptable reads/corresponding data
with open(current_file, 'r') as f: #This is the input fastq
lineCharsList = [] #Contains unique concatenated strings of first and last 6 chars of read line
headerLine = next(f) #Stores the header information for each line
counter = 1
for line in f:
if counter == 1:
lineChars = line[0:6]+line[145:151] #Identify first concatenated string of first/last 6 chars
lineCharsList.append(lineChars)
#Write first read/info to output
target.write(headerLine)
target.write(line)
nextLine = next(f)
target.write(nextLine)
nextLine = next(f)
target.write(nextLine)
headerLine = next(f) #Move to next header
counter+=1
elif counter > 1:
lineChars = line[0:6]+line[145:151] #Get first/last six chars from next read
different_enough = True
for i in lineCharsList: #Iterate through list and compare with current read
if distance(lineChars, i) < 2: #Levenshtein distance
different_enough = False
for skip in range(3): #If read too similar, skip over it
try:
check = line #Check for additional lines in file
headerLine = next(f) #Move to next header
except StopIteration:
break
elif distance(lineChars, i) >= dist_stringency & different_enough == True: #If read is unique enough, write to output
lineCharsList.append(lineChars)
target.write(headerLine)
target.write(line)
nextLine = next(f)
target.write(nextLine)
nextLine = next(f)
target.write(nextLine)
try:
check = line
headerLine = next(f)
except StopIteration:
break
target.close()
Desired output of test file would be the following, where all reads are unique, but the read on line 10 has a Levenshtein distance < 2 to the read on line 2 so would not be included in the output:
1 Unique Header #Read 1
2 AAAAAA.....AAAAAA #Sequence Read 1
3 +
4 ??AA@F #Quality of Read 1
5 Unique Header #Read 2
6 ATTAAA.....AAAAAA
7 +
8 >>AA?B
It looks like you're testing whether each read is different enough from any previous read, but what you really want is the set of reads that are different from all previous reads.
You could set a flag different_enough = True
before you enter this loop: for i in lineCharsList:
Then, when you test if distance(lineChars, i) < 2
set it to different_enough = False
.
Don't print out anything inside the loop, wait until after it has completed and then check the status of different_enough
. If your read passed every comparison it will still be True, so print out the read. If if even one read was too similar it will be False.
That way you'll only print the read if it passed every comparison.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.