简体   繁体   English

DNA 序列中的重复字符串

[英]Repeated string in a DNA Sequence

I am trying to solve the following problem我正在尝试解决以下问题

Problem All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG".问题所有 DNA 都由一系列缩写为 A、C、G 和 T 的核苷酸组成,例如:“ACGAATTCCG”。 When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.在研究 DNA 时,识别 DNA 中的重复序列有时很有用。

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.编写一个函数来查找在一个 DNA 分子中出现不止一次的所有 10 个字母长的序列(子串)。

For that my solution is as below.为此,我的解决方案如下。

def repeated_dna_sequence(seq):
  # Gather length 10 substrings and remember as seen
  repeated_sub_str = set()
  seen = dict()

  i = 0
  while (i + 10 < len(seq)):
    sub_str = seq[i:i+10]
    if not sub_str in seen:
      seen[sub_str] = 1
    else:
      repeated_sub_str.add(sub_str)
    i += 1
  return repeated_sub_str

For the input sequence input_str = "AAAAACCCCCAAAAACCCCCAAAAAGGGTTT" , my code returns ['AAAAACCCCC', 'AAAACCCCCA', 'AAACCCCCAA', 'AACCCCCAAA', 'ACCCCCAAAA', 'CCCCCAAAAA'] .对于输入序列input_str = "AAAAACCCCCAAAAACCCCCAAAAAGGGTTT" ,我的代码返回['AAAAACCCCC', 'AAAACCCCCA', 'AAACCCCCAA', 'AACCCCCAAA', 'ACCCCCAAAA', 'CCCCCAAAAA'] However for the same question with the same input string in leetcode problem the output is given as ['AAAAACCCCC', 'CCCCCAAAAA'] .然而,对于leetcode 问题中具有相同输入字符串的相同问题,输出给出为['AAAAACCCCC', 'CCCCCAAAAA'] If anybody sheds some light on this issue that would be very helpful.如果有人对这个问题有所了解,那将非常有帮助。

Thank you.谢谢你。

There is a difference between your input and the leetcode input.您的输入和 leetcode 输入之间存在差异。 This gives you more hits.这会给你更多的点击。

Your Input: AAAAACCCCCAAAAACCCCCAAAAAGGGTTT
Leet Input: AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM