DNA 序列中的重复字符串

Question

I am trying to solve the following problem我正在尝试解决以下问题

Problem All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG".问题所有 DNA 都由一系列缩写为 A、C、G 和 T 的核苷酸组成，例如：“ACGAATTCCG”。 When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.在研究 DNA 时，识别 DNA 中的重复序列有时很有用。

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.编写一个函数来查找在一个 DNA 分子中出现不止一次的所有 10 个字母长的序列（子串）。

For that my solution is as below.为此，我的解决方案如下。

def repeated_dna_sequence(seq):
  # Gather length 10 substrings and remember as seen
  repeated_sub_str = set()
  seen = dict()

  i = 0
  while (i + 10 < len(seq)):
    sub_str = seq[i:i+10]
    if not sub_str in seen:
      seen[sub_str] = 1
    else:
      repeated_sub_str.add(sub_str)
    i += 1
  return repeated_sub_str

For the input sequence input_str = "AAAAACCCCCAAAAACCCCCAAAAAGGGTTT" , my code returns ['AAAAACCCCC', 'AAAACCCCCA', 'AAACCCCCAA', 'AACCCCCAAA', 'ACCCCCAAAA', 'CCCCCAAAAA'] .对于输入序列input_str = "AAAAACCCCCAAAAACCCCCAAAAAGGGTTT" ，我的代码返回['AAAAACCCCC', 'AAAACCCCCA', 'AAACCCCCAA', 'AACCCCCAAA', 'ACCCCCAAAA', 'CCCCCAAAAA'] 。 However for the same question with the same input string in leetcode problem the output is given as ['AAAAACCCCC', 'CCCCCAAAAA'] .然而，对于leetcode 问题中具有相同输入字符串的相同问题，输出给出为['AAAAACCCCC', 'CCCCCAAAAA'] 。 If anybody sheds some light on this issue that would be very helpful.如果有人对这个问题有所了解，那将非常有帮助。

Thank you.谢谢你。

Answer 1

There is a difference between your input and the leetcode input.您的输入和 leetcode 输入之间存在差异。 This gives you more hits.这会给你更多的点击。

Your Input: AAAAACCCCCAAAAACCCCCAAAAAGGGTTT
Leet Input: AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT

DNA 序列中的重复字符串

问题描述

1 个解决方案

解决方案1
0 2020-01-17 06:57:39

DNA 序列中的重复字符串

问题描述

1 个解决方案

解决方案1 0 2020-01-17 06:57:39

解决方案1
0 2020-01-17 06:57:39