简体   繁体   中英

Python; counts words from one file in lines from other file

I have a file with words, I import them to python with pandas. With my code, I want to count the amount of words in other files and output the counting per word per file. I am looping over multiple files, therefore I am using glob. That works fine, but the problem is the counting

My file looks like this

>1
GTCTTCCGGCGAGCGGGCTTTTCACCCGCTTTATCGTTACTTATGTCAGCATTCGCACTT
CTGATACCTCCAGCAACCCTCACAGGCCACCTTCGCAGGCTTACAGAACGCTCCCCTACC
>2
AAAGAAAGCGTAATAGCTCACTGGTCGAGTCGGCCTGCGCGGAAGATGTAACGGGGCTAA
ACCATGCACCGAAGCTGCGGCAGCGACACTCAGGTGTTGTTGGGTAGGGGAGCGTTCTGT     

The word txt file contain the words that I am looking for. To simplify it, for example it contains the words "GTCTT, CCCGC and AACGG".

With my code, I want to look for these words and count them with the following code

import pandas as pd
import glob 
from itertools import groupby

word = pd.read_csv("word.txt", delim_whitespace=True,header=None)

for file in glob.glob('input.txt'):
    with open(file) as f:
        for k, g in groupby(f, lambda x: x.startswith('>')):
            if k:
                sequence = next(g).strip('>\n')
            else:
                d1 = list(''.join(line.strip() for line in g))
                counts = Counter()

                if d1 == word:
                    counts[d1] += 1
                    print(counts)

My output must tell me how many time the words are found

>1
GTCTT 1
CCCGC 1
AACGG 0
>2 
GTCTT 0
CCCGC 0
AACGG 1

Can someone please help me to change the code for the counting? I do not know how to do it.

I changed your code a bit:

#!/usr/bin/env python

with open('file.txt','r') as f: l = f.read().splitlines()
with open('word.txt', 'r') as f: words = f.read().split()

nl = [i for s in [[j,l[i+1]+l[i+2]]for i,j in enumerate(l) if '>' in j] for i in s]

counts = {}
for i in nl:
    if '>' in i:
        print i
        counts = {}
    else:
        counts = {w:i.count(w) for w in words}
        for k,v in counts.items(): print '{} {}'.format(k,v)

In the above code snippet:

  • "word.txt" contains the words as GTCTT CCCGC AACGG (space separated) and
  • "file.txt" the lines with the sequences as described in the post.

The above code gives as a result:

>1
AACGG 0
GTCTT 1
CCCGC 1
>2
AACGG 1
GTCTT 0
CCCGC 0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM