简体   繁体   English

在python中计算Gene段并在列中打印它们

[英]Counting Gene segments in python and print them in columns

I need to convert a text file into species and counts of gene segments. 我需要将文本文件转换为物种和基因片段的数量。 For this I wanted to create a dictionary, filled with keys i searched with a pattern. 为此我想创建一个字典,用我用模式搜索的键填充。 Every key should have 3 items (digits) starting with 0. With another patterns, I want to look for the gene segments and if there is one, I want to increase the count for that. 每个键应该有3个项目(数字)从0开始。使用另一个模式,我想查找基因片段,如果有,我想增加计数。

I'm searching for 3 different gene segments, why I only want to increase item1 , item2 or item3 . 我正在寻找3个不同的基因片段,为什么我只想增加item1item2item3 Is there a way to do this with python? 有没有办法用python做到这一点?

That's the code I wrote till now, but I don't know how to continue. 这是我写的代码,但我不知道如何继续。

matrix = {}
pattern = re.compile(r"[A-Za-z ]*")
pattern_v = re.compile(r";[A_Z]+V[0-9]?;")
pattern_d = re.compile(r";[A_Z]+D[0-9]?;")
pattern_j = re.compile(r";[A_Z]+J[0-9]?;")
for i in file.readlines():
    name = pattern.search(i)
    if pattern_v.search:
        if name.group() not in matrix:
            matrix.update(name.group(), (1,0,0))
        else:
            matrix[(name.group()[0]] = matrix[(name.group()[0]]+1
...

As you can see, if pattern_v was found, I want to increase the item at position zero. 如您所见,如果找到了pattern_v ,我想在零位置增加该项目。 I know, that the last command doesn't work, I just wrote it to explain, what I want to do. 我知道,最后一个命令不起作用,我只是写它来解释,我想做什么。

EDIT ADD: I got the algorithm working, but now i have the problem, that i cant print it like i want. 编辑添加:我有算法工作,但现在我有问题,我不能像我想要的那样打印它。

{'Mus cookii': [0, 0, 0], 'Ovis aries': [0, 7, 9], 'Camelus dromedarius': [2, 0, 0], 'Danio rerio': [1, 1, 5], 'Mus saxicola': [0, 0, 0], 'Homo sapiens': [21, 6, 33], 'Rattus norvegicus': [0, 1, 12], 'Sus scrofa': [0, 5, 13], 'Vicugna pacos': [0, 9, 7], 'Macaca nemestrina': [0, 0, 0], 'Mus spretus': [4, 0, 2], 'Mus musculus': [30, 5, 28], 'Mus minutoides': [0, 0, 0], 'Oncorhynchus mykiss': [0, 11, 16], 'Canis lupus familiaris': [4, 2, 0], 'Bos taurus': [2, 5, 12], 'Cercocebus atys': [0, 0, 0], 'Oryctolagus cuniculus': [0, 0, 10], 'Rattus rattus': [0, 0, 0], 'Ornithorhynchus anatinus': [0, 4, 9], 'Macaca mulatta': [1, 3, 16], 'Papio anubis anubis': [0, 0, 0], 'Macaca fascicularis': [0, 0, 0], 'Mus pahari': [0, 0, 0]} {'Mus cookii':[0,0,0],'Ovis aries':[0,7,9],'Camelus dromedarius':[2,0,0],'Danio rerio':[1,1, 5],'Mus saxicola':[0,0,0],'Homo sapiens':[21,6,33],'Rattus norvegicus':[0,1,12],'Sus scrofa':[0, 5,13],'Vicugna pacos':[0,9,7],'Macaca nemestrina':[0,0,0],'Mus spretus':[4,0,2],'Mus musculus':[ 30,5,28],'Mus minutoides':[0,0,0],'Oncorhynchus mykiss':[0,11,16],'Canis lupus familiaris':[4,2,0],'Bos taurus ':[2,5,12],'Cercocebus atys':[0,0,0],'Oryctolagus cuniculus':[0,0,10],'Rattus rattus':[0,0,0],' Ornithorhynchus anatinus':[0,4,9],'Macaca mulatta':[1,3,16],'Papio anubis anubis':[0,0,0],'Macaca fascicularis':[0,0,0 ],'Mus pahari':[0,0,0]}

is the output, but i need to make it more comfortable to read. 是输出,但我需要让它更舒适阅读。 The idea is to make a output with columns (name,v,d,j). 我们的想法是使用列(名称,v,d,j)创建输出。 I tried: 我试过了:

def printStatistics(dict):
    for i in range(0,len(dict)):
        print(" {0:30s}{1:30d}{2:30d}{3:30d}".format(dict[i],dict[i]    [0],dict[i][1],dict[i][2]), sep = "")

but i get 但我明白了

"TypeError: non-empty format string passed to object. format " “类型错误:传递给对象非空的格式字符串格式

You can make your algorithm work with collections.defaultdict : 您可以使您的算法使用collections.defaultdict

input data 输入数据

import re
from collections import defaultdict
import numpy as np

data= '''Bos taurus;TRGV8-1;F;Bos taurus T cell receptor gamma variable 8-1;1;4;4q3.1;AY644517;-;
Bos taurus;TRGV8-2;(F) F;Bos taurus T cell receptor gamma variable 8-2;2;4;4q3.1;AY644517;-;
Camelus dromedarius;TRDV1S3;F;Camelus dromedarius T cell receptor delta variable 1S3;1;-;-;FN298223;-;
Camelus dromedarius;TRDV1S4;F;Camelus dromedarius T cell receptor delta variable 1S4;2;-;-;FN298224;-;
Canis lupus familiaris;TRBD2;F;Canis lupus familiaris T cell receptor beta diversity 2;1;16;-;HE653929;-;'''
patterns = [
    re.compile(r"TR.V"),
    re.compile(r"TR.D"),
    re.compile(r"TR.J")
]
result = defaultdict(lambda:np.array([0,0,0]))

script 脚本

for line in data.splitlines():
    result[line.split(';')[0]]+=np.array([len(pattern.findall(line)) for pattern in patterns])
print(result)

output 产量

defaultdict(<function <lambda> at 0x7f622f81c140>, {'Camelus dromedarius': array([2, 0, 0]), 'Canis lupus familiaris': array([0, 1, 0]), 'Bos taurus': array([2, 0, 0])})

defaultdict works like a dictionary, but every key is initialized with a callable of your choice. defaultdict工作方式类似于字典,但每个键都使用您选择的可调用语句进行初始化。 lambda: [0,0,0] gives you the ability to immediately increment the group occurences instead of having to do update and increment. lambda: [0,0,0]使您能够立即增加组出现次数,而不必进行更新和增量。

I decided to work with numpy arrays because they support vector like adding operations which makes the algorithm prettier, you could also do it without numpy. 我决定使用numpy数组,因为它们支持向量,如添加操作,使算法更漂亮,你也可以不用numpy。

Found a solution now with defaultdictionary: 现在找到一个使用defaultdictionary的解决方案:

def find_name(file):
    gene_count = defaultdict(lambda:[0,0,0])
    pattern = re.compile(r"[A-Za-z ]*")
    pattern_v = re.compile(r"\;[A-Z]+V[0-9]?\;")
    pattern_d = re.compile(r"\;[A-Z]+D[0-9]?\;")
    pattern_j = re.compile(r"\;[A-Z]+J[0-9]?\;")
    for i in file.readlines():
        name = pattern.search(i)
        name = name.group()
        if name not in gene_count and name != "Species":
            gene_count.update({name:[0,0,0]})
        if pattern_v.search(i):
            gene_count[name][0] += 1
        elif pattern_d.search(i):
            gene_count[name][1] += 1
        elif pattern_j.search(i):
            gene_count[name][2] += 1
    return gene_count

PRINTING: 打印:

def printStatistics(dict):
    print(" {0:<30s}{1:<15s}{2:<15s}{3:<15s}".format("Species", "V Count", "D Count", "J Count"), sep = "")
    for item in dict:
        print(" {0:<30s}{1:<15d}{2:<15d}{3:<15d}".format(item,dict[item][0],dict[item][1],dict[item][2]), sep = "")

Thx 4 help! Thx 4帮忙!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM