求（L，t）-团

Question

I'm trying to find clumps in Clojure. 我试图在Clojure中找到团块。 Basically, I need to find all k-length substrings that occur in a window of size L in a genome that occurs t times. 基本上，我需要找到出现在基因组中大小为L的窗口中的所有k长度子串，该子串出现t次。 I've implemented what I think the solution is, however I believe there might be bugs in it since the system (beta.stepic.org) I'm using to confirm tells me so. 我已经实现了我认为的解决方案，但是由于我用来确认的系统（beta.stepic.org）告诉我，我相信其中可能存在错误。 Can you guys spot where I'm messing up? 你们能发现我在搞砸吗？ My solution goes as follows, find all top ranking k-mers (k-length substrings) and find their starting indices. 我的解决方案如下，找到所有排名最高的k-mers（k长度子字符串）并找到它们的起始索引。 Afterwards, I partition in groups of t, which means this is the amount of times they occur and basically do a difference of the last and first item in a partitioned group with an offset of k (since all k-mers should fit in the L-window and this would account for the last k-mer by extending it). 然后，我将它们划分为t个组，这意味着这是它们发生的次数，并且基本上对一个分组中的最后一项和第一项进行了相差k的偏移（因为所有k-mers应该适合L -window，这将通过扩展来解决最后一个k-mer）。 The indices are in ascending order. 索引按升序排列。 Where's the bug? 错误在哪里？

Clump Finding Problem: Find patterns forming clumps in a string. 团块查找问题：查找在字符串中形成团块的模式。

 Input: A string Genome, and integers k, L, and t.
 Output: All distinct k-mers forming (L, t)-clumps in Genome.

Sample Input : 样本输入 ：

genome: CGGACTCGACAGATGTGAAGAACGACAATGTGAAGACTCGACACGACAGAGTGAAGAGAAGAGGAAACATTGTAA 基因组：CGGACTCGACAGATGTGAAGAACGACAATGTGAAGACTCGACACGACAGAGTGAAGAGAAGAGGAAACATTGTAA

 k: 5 
 L: 50 
 t: 4

Sample Output : 样本输出 ：

CGACA GAAGA 加加

(defn get-indices [source target]
  "Returns the indices for the substring target
   found in source in ascending order. This includes overlaps."
  (let
    [search   (java.util.regex.Pattern/compile (str "(?=(" target "))"))
     matcher  (re-matcher search source)
     not-nil? (complement nil?)]

    (defn inner [matcher]
      (if (not-nil? (re-find matcher))
        (cons (.start matcher) (inner matcher))))
          (inner matcher)))

(defn get-frequent-kmer [source k]
  "Gets the most frequenct k-mers of size k from source"
  (let [max-val (val (apply max-key val (frequencies (partition k 1 source))))]
    (map first (filter #(= (val %) max-val)
      (frequencies (map (partial apply str) (partition k 1 source)))))))


(defn find-clumps [genome k L t]
  (for [k-mer (get-frequent-kmer genome k)]
    (let [indices (get-indices genome k-mer)]
      (if (some true? (map #(<= (+ k (- (last %) (first %))) L)
        (partition t 1 indices))) k-mer))))

Answer 1

Besides code style which has a couple things that can be improved, main problem I see is you're filtering k-mers on max-key val and you're not considering t at all on the initial filtering. 除了可以改进的代码样式外，我看到的主要问题是您正在对max-key val上的k-mers进行过滤，而在初始过滤中根本没有考虑t 。

When you find the most frequent l-mers of size k you're just keeping the longer ones: 当您找到大小为k的最常见的L-单体时，您只需保留更长的一个：

  (apply max-key val (frequencies (partition k 1 source)))

Since you filter by max-val 由于您通过max-val进行过滤

  (filter #(= (val %) max-val)

And you're only analyzing those: 而且您只是在分析那些：

  (for [k-mer (get-frequent-kmer genome k)]

The problem is that if t is 4, but you have some 5-mers with more than 4 repeats, you're leaving the ones repeated 4 times out of the equation. 问题是，如果t为4，但是您有一些5聚体，且重复数超过4，则将这些重复数保留4次。

Answer 2

Here is some working code: 这是一些工作代码：

(defn k-mers 
  "Returns a seq of all k-mers in text."
  [k text]
  (map #(apply str %) (partition k 1 text)))

(defn most-frequent-k-mers 
  "Returns a seq of k-mers in text appearing at least t times."
  [k t text]
  (->> (k-mers k text)
       (frequencies)
       (filter #(<= t (second %)))
       (map first)))

(defn find-clump
  "Finds k-mers forming (L, t) clumps in text."
  [k L t text]
  (let [windows (partition L 1 text)]
    (->> windows 
         (map #(most-frequent-k-mers k t %))
         (map set)
         (apply clojure.set/union))))

I think you should start from here. 我认为您应该从这里开始。

求（L，t）-团

问题描述

Clump Finding Problem: Find patterns forming clumps in a string. 团块查找问题：查找在字符串中形成团块的模式。

2 个解决方案

解决方案1
1 已采纳 2013-11-20 16:38:09

解决方案2
0 2013-11-20 21:07:41

求（L，t）-团

问题描述

Clump Finding Problem: Find patterns forming clumps in a string. 团块查找问题：查找在字符串中形成团块的模式。

2 个解决方案

解决方案1 1 已采纳 2013-11-20 16:38:09

解决方案2 0 2013-11-20 21:07:41

解决方案1
1 已采纳 2013-11-20 16:38:09

解决方案2
0 2013-11-20 21:07:41