简体   繁体   中英

Counting lines containing each substring present in a file

So I wrote this code for finding a sub-string (of length k ) in a string. I expect it to check through 466 strings (from a file opened through pc ) and if a sub-string is present add 1 to the sub-string dictionary proteinCDict , thus basically counting in how many sequences does a sub-string occur. Apparently it is not working:

import operator

proteinCDict = {}
for i in range(0, 466):
    record = []
    pc.readline()
    sequence = pc.readline()
    for j in range(0, len(sequence)-k):
        if((sequence[j:j+k] in proteinCDict) and\
           (sequence[j:j+k] not in record)):
            record.append(sequence[j:j+k])
            proteinCDict[sequence[j:j+k]] += 1
        else:
            record.append(sequence[j:j+k])
            proteinCDict[sequence[j:j+k]] = 1

proteinCDict =  sorted(proteinCDict.items(), key=operator.itemgetter(1))
print(proteinCDict)

The problem I'm facing is illustrated through a particular case when k=7, the sub-string with highest frequency of occurrence is lower than when k=8. This should not be the case since the sub-string with highest frequency in k=8 can be divided into two sub-strings of length 7. So where am I going wrong?

EDIT: Every alternate line is a space hence I'm calling readline() 2 times.

First, a few comments on your code:

  • The main issue I see is that by looping over range(0, len(sequence)-k) , you are skipping the subsequence sequence[len(sequence)-k:] .

  • If you are to open a file, you should use a with statement.

  • Instead of using a range , you can directly iterate over your file object to get its lines.

  • For anything related to counting, a collections.Counter is probably better suited.

  • To track which subsequence have been seen on a single line, a set is a better suited data strucutre than a list as it allows constant time lookup.

The following solution uses a Counter , you can then use Counter.most_common to sort the subsequences by number of appearances.

Code

import collections

def count_in_file(filename, k):
    counter = collections.Counter()

    with open(filename, 'r') as f:
        for line in f:
            line = line.strip()

            line_sequences = set(line[i:i+k] for i in range(len(line) + 1 - k))

            for seq in line_sequences:
                counter[seq] += 1

    return counter

counter = count_in_file('test_file.txt', 3)

print(counter.most_common())

Test file

ABCABC

BCA

Output

[('BCA', 2), ('CAB', 1), ('ABC', 1)]

Just looking at the logic for now,

you should do something like j:j+k-1 since j's first position is always 0.

I would suggest setting a variable for the new end position such as endpos = j+k-1 and use that instead

Also, if the substring you seek is in proteinCDict , you shouldn't append it anymore. You just need to seek it as you have done.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM