简体   繁体   English

如何在Python上使用字典从文件更改值

[英]How to change values from file using a dictionary on Python

I'm doing a biology degree and feel like I've been thrown in at the deep end with python, as I've never coded before, and the 'teaching' was pretty much non-existent. 我正在攻读生物学学位,并且觉得自己被python深深地吸引住了,因为我以前从未编码过,而且“教学”几乎不存在。 Anyway, they've given this file of gene sequences, which pretty much looks like : 无论如何,他们给了这个基因序列文件,看起来像:

En123, ATGCCGAATA

En124, ATGCCAGTAT

but much longer with way more genes. 但是随着更多基因的出现,时间会更长。 They want it converted into a protein sequence. 他们希望将其转换为蛋白质序列。 So far, I've got... 到目前为止,我已经...

with open('DNA_sequences.csv', 'r') as f:

for line in f:
    columns = line.rstrip("\n").split(",") #remove end of line charcters and split at commas to produce a list
    ensemblID = columns[0] #ensemblID is first element in our list
    gene_sequence = columns [1] #gene_name is second element in list

wasn't sure if I needed the columns or not. 不知道我是否需要这些列。

I've also made a dictionary for the protein sequence, with the amino acid and the corresponding codon. 我还制作了蛋白质序列的字典,其中包含氨基酸和相应的密码子。

protein_sequence= {'TTT': 'F', 'CTT': 'L', 'GAT':'D'} etc.

So I'm wondering how to I split the gene sequence in my file into codons, then pass it through the dictionary so I get the sequence of amino acid names. 因此,我想知道如何将文件中的基因序列分成密码子,然后将其通过词典,以便获得氨基酸名称的序列。

i.e. gene_sequence= TTTCTTTGAT to protein_sequence= FLD

(Sorry for being so incompetent!) (很抱歉!)

so to load the csv I'd use the csv module like so: 所以要加载csv,我会像这样使用csv模块:

import csv

with open(filepath) as csvFile:
    reader = csv.reader(csvFile)
    data = [row for row in reader]

then to convert the gene sequence: 然后转换基因序列:

geneSeq = "TTTCTTTGAT"

acids = [geneSeq[i:i+3] for i in range(0, len(geneSeq), 3)]

proteinSequenceString = ""
for a in acids:
    proteinSequenceString += protein_sequence[a]

You can iterate over gene_sequence in chunks of 3 and lookup codons in your dictionary: 您可以在3个大块中遍历gene_sequence并在字典中查找密码子:

>>> gene_sequence = 'TTTCTTGAT'
>>> protein_sequence = {'TTT': 'F', 'CTT': 'L', 'GAT': 'D'}
>>> ''.join(protein_sequence[gene_sequence[i:i+3]] for i in range(0, len(gene_sequence), 3))
'FLD'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM