简体   繁体   English

求(L,t)-团

[英]Find (L,t)-clump

I'm trying to find clumps in Clojure. 我试图在Clojure中找到团块。 Basically, I need to find all k-length substrings that occur in a window of size L in a genome that occurs t times. 基本上,我需要找到出现在基因组中大小为L的窗口中的所有k长度子串,该子串出现t次。 I've implemented what I think the solution is, however I believe there might be bugs in it since the system (beta.stepic.org) I'm using to confirm tells me so. 我已经实现了我认为的解决方案,但是由于我用来确认的系统(beta.stepic.org)告诉我,我相信其中可能存在错误。 Can you guys spot where I'm messing up? 你们能发现我在搞砸吗? My solution goes as follows, find all top ranking k-mers (k-length substrings) and find their starting indices. 我的解决方案如下,找到所有排名最高的k-mers(k长度子字符串)并找到它们的起始索引。 Afterwards, I partition in groups of t, which means this is the amount of times they occur and basically do a difference of the last and first item in a partitioned group with an offset of k (since all k-mers should fit in the L-window and this would account for the last k-mer by extending it). 然后,我将它们划分为t个组,这意味着这是它们发生的次数,并且基本上对一个分组中的最后一项和第一项进行了相差k的偏移(因为所有k-mers应该适合L -window,这将通过扩展来解决最后一个k-mer)。 The indices are in ascending order. 索引按升序排列。 Where's the bug? 错误在哪里?

Clump Finding Problem: Find patterns forming clumps in a string. 团块查找问题:查找在字符串中形成团块的模式。

 Input: A string Genome, and integers k, L, and t.
 Output: All distinct k-mers forming (L, t)-clumps in Genome.

Sample Input : 样本输入

genome: CGGACTCGACAGATGTGAAGAACGACAATGTGAAGACTCGACACGACAGAGTGAAGAGAAGAGGAAACATTGTAA 基因组:CGGACTCGACAGATGTGAAGAACGACAATGTGAAGACTCGACACGACAGAGTGAAGAGAAGAGGAAACATTGTAA

 k: 5 
 L: 50 
 t: 4

Sample Output : 样本输出

CGACA GAAGA 加加

(defn get-indices [source target]
  "Returns the indices for the substring target
   found in source in ascending order. This includes overlaps."
  (let
    [search   (java.util.regex.Pattern/compile (str "(?=(" target "))"))
     matcher  (re-matcher search source)
     not-nil? (complement nil?)]

    (defn inner [matcher]
      (if (not-nil? (re-find matcher))
        (cons (.start matcher) (inner matcher))))
          (inner matcher)))

(defn get-frequent-kmer [source k]
  "Gets the most frequenct k-mers of size k from source"
  (let [max-val (val (apply max-key val (frequencies (partition k 1 source))))]
    (map first (filter #(= (val %) max-val)
      (frequencies (map (partial apply str) (partition k 1 source)))))))


(defn find-clumps [genome k L t]
  (for [k-mer (get-frequent-kmer genome k)]
    (let [indices (get-indices genome k-mer)]
      (if (some true? (map #(<= (+ k (- (last %) (first %))) L)
        (partition t 1 indices))) k-mer))))

Besides code style which has a couple things that can be improved, main problem I see is you're filtering k-mers on max-key val and you're not considering t at all on the initial filtering. 除了可以改进的代码样式外,我看到的主要问题是您正在对max-key val上的k-mers进行过滤,而在初始过滤中根本没有考虑t

When you find the most frequent l-mers of size k you're just keeping the longer ones: 当您找到大小为k的最常见的L-单体时,您只需保留更长的一个:

  (apply max-key val (frequencies (partition k 1 source)))

Since you filter by max-val 由于您通过max-val进行过滤

  (filter #(= (val %) max-val)

And you're only analyzing those: 而且您只是在分析那些:

  (for [k-mer (get-frequent-kmer genome k)]

The problem is that if t is 4, but you have some 5-mers with more than 4 repeats, you're leaving the ones repeated 4 times out of the equation. 问题是,如果t为4,但是您有一些5聚体,且重复数超过4,则将这些重复数保留4次。

Here is some working code: 这是一些工作代码:

(defn k-mers 
  "Returns a seq of all k-mers in text."
  [k text]
  (map #(apply str %) (partition k 1 text)))

(defn most-frequent-k-mers 
  "Returns a seq of k-mers in text appearing at least t times."
  [k t text]
  (->> (k-mers k text)
       (frequencies)
       (filter #(<= t (second %)))
       (map first)))

(defn find-clump
  "Finds k-mers forming (L, t) clumps in text."
  [k L t text]
  (let [windows (partition L 1 text)]
    (->> windows 
         (map #(most-frequent-k-mers k t %))
         (map set)
         (apply clojure.set/union))))

I think you should start from here. 我认为您应该从这里开始。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 给定长度L仅查找作为&BS&gt; = L形成的最短的字符串,使得加入一些字符(A或B)不产生新的回文 - Given length L find the shortest string formed only of as & bs >= L such that adding some character (Either a or b) doesn't produce a new palindrome 在范围内查找 min(A[L], max(A[L+1], min(A[L+2],...,a[R]))) - Find min(A[L], max(A[L+1], min(A[L+2],...,a[R]))) in range 找到 min(max(A[L], A[L+1],...,A[R]), min(B[L], B[L+1],..., B[R])) 的有效方法 - Efficient way to find min(max(A[L], A[L+1],…,A[R]), min(B[L], B[L+1],…, B[R])) 如何在矩阵中找到最大的L sum? - how to find the maximum L sum in a matrix? 识别数据块(段)以实现块排序 - Identifying Clumps (Segments) of Data for Clump Sort Implementation 给定一个对象A和一个对象L列表,如何在不测试所有情况的情况下查找L上的哪些对象是A的克隆? - Given an object A and a list of objects L, how to find which objects on L are clones of A without testing all cases? 在C ++中的2D网格中找到所有长度为L的路径 - Find all paths of length L in 2 D grid in C++ 查找数组中乘积在 l 和 r 之间的对数 - Find the number of pairs in an array with product between l and r incluseive 分割数组并找到最大| max(L)-max(R)| - splitting an array and find maximal |max (L) -max (R)| 保证找到四个单色点的最小k X l网格 - Smallest k X l Grid That Guarantees to Find Four Monochromatic Points
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM