简体   繁体   English

基本 DNA 编码练习

[英]Basic DNA Coding Exercise

I recently failed an interview in which I was thrown a Python coding question out of the blue.我最近在一次面试中失败了,我突然被抛出了一个 Python 编码问题。 I'm currently learning Python, and if I came upon the same question again or a similar question I want to be able to answer it.我目前正在学习 Python,如果我再次遇到相同的问题或类似的问题,我希望能够回答它。

The question was as follows:问题如下:

Write a function which takes as its input a string containing the letters: [A, C, G, T];编写一个 function ,它的输入是一个包含以下字母的字符串:[A, C, G, T]; and outputs all the 3-letter subsequences found in the input and the frequency with which they occur.并输出在输入中找到的所有 3 字母子序列以及它们出现的频率。 For example, if the input string was "ACTACTTAC", the output would be something like:例如,如果输入字符串是“ACTACTTAC”,则 output 将类似于:

 ACT: 2 CTA: 1 TAC: 2 CTT: 1 TTA: 1

I came up with some ideas after the fact and I had wondered if a solution like this works, or is there a better way of doing it?事后我想出了一些想法,我想知道这样的解决方案是否有效,或者有更好的方法吗?

def Determine_DNA(dna_list):
    n = len(dna_list[0])
    A = [0]*n
    T = [0]*n
    G = [0]*n
    C = [0]*n
    for dna in dna_list:
        for index, base in enumerate(dna):
            if base == 'A':
                A[index] += 1
            elif base == 'C':
                C[index] += 1
            elif base == 'G':
                G[index] += 1
            elif base == 'T':
                T[index] += 1
    return A, C, G, T

@mousetail mentioned in the comments using collections.Counter . @mousetail 在使用collections.Counter的评论中提到。 Here is an example of that:这是一个例子:

import collections

def dna_freq(dnaseq):
    seq_list = []
    for i in range(2, len(dnaseq)):
        seq_list.append(dnaseq[i-2:i+1])
    return dict(collections.Counter(seq_list))

print(dna_freq("ACTACTTAC"))

{'ACT': 2, 'CTA': 1, 'TAC': 2, 'CTT': 1, 'TTA': 1}

That could be code-golf'd, if hard to read code is your thing:如果难以阅读代码是你的事,那可能是代码高尔夫:

 def dna_freq(dnaseq):
     return dict(collections.Counter([dnaseq[i-2:i+1] for i in range(2, len(dnaseq))]))

Example using zip from the comments, which feels more approachable than list comprehension.使用评论中的zip的示例,感觉比列表理解更平易近人。 It does give a slightly different, but totally usable output.它确实提供了一个略有不同但完全可用的 output。

def dna_freq(dnaseq):
    return dict(collections.Counter(zip(dnaseq, dnaseq[1:], dnaseq[2:])))

This works for your case:这适用于您的情况:

dna = "ACTACTTAC"
LEN = 3
d = set()

for i in range(len(dna)-LEN):
    k = dna[i:i+LEN]
    if not k in d:
        print(f'{k}: {dna.count(k)}')
        d.add(k)

Output: Output:

ACT: 2
CTA: 1
TAC: 2
CTT: 1
TTA: 1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM