Counting lines containing each substring present in a file

Question

So I wrote this code for finding a sub-string (of length k ) in a string. I expect it to check through 466 strings (from a file opened through pc ) and if a sub-string is present add 1 to the sub-string dictionary proteinCDict , thus basically counting in how many sequences does a sub-string occur. Apparently it is not working:

import operator

proteinCDict = {}
for i in range(0, 466):
    record = []
    pc.readline()
    sequence = pc.readline()
    for j in range(0, len(sequence)-k):
        if((sequence[j:j+k] in proteinCDict) and\
           (sequence[j:j+k] not in record)):
            record.append(sequence[j:j+k])
            proteinCDict[sequence[j:j+k]] += 1
        else:
            record.append(sequence[j:j+k])
            proteinCDict[sequence[j:j+k]] = 1

proteinCDict =  sorted(proteinCDict.items(), key=operator.itemgetter(1))
print(proteinCDict)

The problem I'm facing is illustrated through a particular case when k=7, the sub-string with highest frequency of occurrence is lower than when k=8. This should not be the case since the sub-string with highest frequency in k=8 can be divided into two sub-strings of length 7. So where am I going wrong?

EDIT: Every alternate line is a space hence I'm calling readline() 2 times.

Answer 1

First, a few comments on your code:

The main issue I see is that by looping over range(0, len(sequence)-k) , you are skipping the subsequence sequence[len(sequence)-k:] .
If you are to open a file, you should use a with statement.
Instead of using a range , you can directly iterate over your file object to get its lines.
For anything related to counting, a collections.Counter is probably better suited.
To track which subsequence have been seen on a single line, a set is a better suited data strucutre than a list as it allows constant time lookup.

The following solution uses a Counter , you can then use Counter.most_common to sort the subsequences by number of appearances.

Code

import collections

def count_in_file(filename, k):
    counter = collections.Counter()

    with open(filename, 'r') as f:
        for line in f:
            line = line.strip()

            line_sequences = set(line[i:i+k] for i in range(len(line) + 1 - k))

            for seq in line_sequences:
                counter[seq] += 1

    return counter

counter = count_in_file('test_file.txt', 3)

print(counter.most_common())

Test file

ABCABC

BCA

Output

[('BCA', 2), ('CAB', 1), ('ABC', 1)]

Answer 2

Just looking at the logic for now,

you should do something like j:j+k-1 since j's first position is always 0.

I would suggest setting a variable for the new end position such as endpos = j+k-1 and use that instead

Also, if the substring you seek is in proteinCDict , you shouldn't append it anymore. You just need to seek it as you have done.

Counting lines containing each substring present in a file

Question

2 answers

solution1
3 2018-06-14 03:36:14

Code

Test file

Output

solution2
1 2018-06-14 03:33:29

Counting lines containing each substring present in a file

Question

2 answers

solution1 3 2018-06-14 03:36:14

Code

Test file

Output

solution2 1 2018-06-14 03:33:29

solution1
3 2018-06-14 03:36:14

solution2
1 2018-06-14 03:33:29