So I wrote this code for finding a sub-string (of length k
) in a string. I expect it to check through 466 strings (from a file opened through pc
) and if a sub-string is present add 1
to the sub-string dictionary proteinCDict
, thus basically counting in how many sequences does a sub-string occur. Apparently it is not working:
import operator
proteinCDict = {}
for i in range(0, 466):
record = []
pc.readline()
sequence = pc.readline()
for j in range(0, len(sequence)-k):
if((sequence[j:j+k] in proteinCDict) and\
(sequence[j:j+k] not in record)):
record.append(sequence[j:j+k])
proteinCDict[sequence[j:j+k]] += 1
else:
record.append(sequence[j:j+k])
proteinCDict[sequence[j:j+k]] = 1
proteinCDict = sorted(proteinCDict.items(), key=operator.itemgetter(1))
print(proteinCDict)
The problem I'm facing is illustrated through a particular case when k=7, the sub-string with highest frequency of occurrence is lower than when k=8. This should not be the case since the sub-string with highest frequency in k=8 can be divided into two sub-strings of length 7. So where am I going wrong?
EDIT: Every alternate line is a space hence I'm calling readline()
2 times.
First, a few comments on your code:
The main issue I see is that by looping over range(0, len(sequence)-k)
, you are skipping the subsequence sequence[len(sequence)-k:]
.
If you are to open a file, you should use a with
statement.
Instead of using a range
, you can directly iterate over your file object to get its lines.
For anything related to counting, a collections.Counter
is probably better suited.
To track which subsequence have been seen on a single line, a set
is a better suited data strucutre than a list
as it allows constant time lookup.
The following solution uses a Counter
, you can then use Counter.most_common
to sort the subsequences by number of appearances.
import collections
def count_in_file(filename, k):
counter = collections.Counter()
with open(filename, 'r') as f:
for line in f:
line = line.strip()
line_sequences = set(line[i:i+k] for i in range(len(line) + 1 - k))
for seq in line_sequences:
counter[seq] += 1
return counter
counter = count_in_file('test_file.txt', 3)
print(counter.most_common())
ABCABC
BCA
[('BCA', 2), ('CAB', 1), ('ABC', 1)]
Just looking at the logic for now,
you should do something like j:j+k-1
since j's first position is always 0.
I would suggest setting a variable for the new end position such as endpos = j+k-1
and use that instead
Also, if the substring you seek is in proteinCDict
, you shouldn't append it anymore. You just need to seek it as you have done.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.