简体   繁体   English

蟒蛇; 计算一个文件中的单词与另一文件中的行

[英]Python; counts words from one file in lines from other file

I have a file with words, I import them to python with pandas. 我有一个带有单词的文件,我将它们导入到带有pandas的python中。 With my code, I want to count the amount of words in other files and output the counting per word per file. 使用我的代码,我想计算其他文件中的单词数量,并输出每个文件中每个单词的计数。 I am looping over multiple files, therefore I am using glob. 我正在遍历多个文件,因此我正在使用glob。 That works fine, but the problem is the counting 效果很好,但问题在于计数

My file looks like this 我的档案看起来像这样

>1
GTCTTCCGGCGAGCGGGCTTTTCACCCGCTTTATCGTTACTTATGTCAGCATTCGCACTT
CTGATACCTCCAGCAACCCTCACAGGCCACCTTCGCAGGCTTACAGAACGCTCCCCTACC
>2
AAAGAAAGCGTAATAGCTCACTGGTCGAGTCGGCCTGCGCGGAAGATGTAACGGGGCTAA
ACCATGCACCGAAGCTGCGGCAGCGACACTCAGGTGTTGTTGGGTAGGGGAGCGTTCTGT     

The word txt file contain the words that I am looking for. 单词txt文件包含我要查找的单词。 To simplify it, for example it contains the words "GTCTT, CCCGC and AACGG". 为了简化它,例如,它包含单词“ GTCTT,CCCGC和AACGG”。

With my code, I want to look for these words and count them with the following code 使用我的代码,我想查找这些单词并用以下代码计数

import pandas as pd
import glob 
from itertools import groupby

word = pd.read_csv("word.txt", delim_whitespace=True,header=None)

for file in glob.glob('input.txt'):
    with open(file) as f:
        for k, g in groupby(f, lambda x: x.startswith('>')):
            if k:
                sequence = next(g).strip('>\n')
            else:
                d1 = list(''.join(line.strip() for line in g))
                counts = Counter()

                if d1 == word:
                    counts[d1] += 1
                    print(counts)

My output must tell me how many time the words are found 我的输出必须告诉我找到单词的次数

>1
GTCTT 1
CCCGC 1
AACGG 0
>2 
GTCTT 0
CCCGC 0
AACGG 1

Can someone please help me to change the code for the counting? 有人可以帮我更改计数代码吗? I do not know how to do it. 我不知道怎么做。

I changed your code a bit: 我对您的代码做了一些更改:

#!/usr/bin/env python

with open('file.txt','r') as f: l = f.read().splitlines()
with open('word.txt', 'r') as f: words = f.read().split()

nl = [i for s in [[j,l[i+1]+l[i+2]]for i,j in enumerate(l) if '>' in j] for i in s]

counts = {}
for i in nl:
    if '>' in i:
        print i
        counts = {}
    else:
        counts = {w:i.count(w) for w in words}
        for k,v in counts.items(): print '{} {}'.format(k,v)

In the above code snippet: 在上面的代码片段中:

  • "word.txt" contains the words as GTCTT CCCGC AACGG (space separated) and “ word.txt”包含GTCTT CCCGC AACGG (以空格分隔)和
  • "file.txt" the lines with the sequences as described in the post. “ file.txt”带有帖子中描述的序列的行。

The above code gives as a result: 上面的代码给出了结果:

>1
AACGG 0
GTCTT 1
CCCGC 1
>2
AACGG 1
GTCTT 0
CCCGC 0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 python:用其他文件中的单词替换文件中的单词 - python: replace words in file with words from other file Python,一一读取文件中的行 - Python, reading lines from a file one by one 从另一个文件中的一个文件计数单词 - Countin words from one file, in another Python 查找一个文件上的行是否在 Python 中的另一个文件的行中显示为单词 - Find if lines on one file appear as words in the lines of another file in Python 有没有一种方法可以从另一个文件中的一个文件中查找单词,并在新文件中输出在另一个文件中找不到的单词? - Is there a way of looking for words from one file in another file and outputting the words not found in the other file, in a new file? 如何使用python复制一个文件中的行并将其写入另一文件中? - How to copy lines from one file and write them in other file using python? 使用python比较文件中不同行的两个单词 - Comparing two words from different lines in a file using python 如何从单词文件的几行中拆分每个单词? (Python) - How to split each words from several lines of a word file? (python) Python脚本从包含数组单词的文件中删除行 - Python script to remove lines from file containing words in array Python-无法将txt文件中的行拆分为单词 - Python - Unable to split lines from a txt file into words
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM