[英]Find (L,t)-clump
I'm trying to find clumps in Clojure. 我试图在Clojure中找到团块。 Basically, I need to find all k-length substrings that occur in a window of size L in a genome that occurs t times.
基本上,我需要找到出现在基因组中大小为L的窗口中的所有k长度子串,该子串出现t次。 I've implemented what I think the solution is, however I believe there might be bugs in it since the system (beta.stepic.org) I'm using to confirm tells me so.
我已经实现了我认为的解决方案,但是由于我用来确认的系统(beta.stepic.org)告诉我,我相信其中可能存在错误。 Can you guys spot where I'm messing up?
你们能发现我在搞砸吗? My solution goes as follows, find all top ranking k-mers (k-length substrings) and find their starting indices.
我的解决方案如下,找到所有排名最高的k-mers(k长度子字符串)并找到它们的起始索引。 Afterwards, I partition in groups of t, which means this is the amount of times they occur and basically do a difference of the last and first item in a partitioned group with an offset of k (since all k-mers should fit in the L-window and this would account for the last k-mer by extending it).
然后,我将它们划分为t个组,这意味着这是它们发生的次数,并且基本上对一个分组中的最后一项和第一项进行了相差k的偏移(因为所有k-mers应该适合L -window,这将通过扩展来解决最后一个k-mer)。 The indices are in ascending order.
索引按升序排列。 Where's the bug?
错误在哪里?
Input: A string Genome, and integers k, L, and t.
Output: All distinct k-mers forming (L, t)-clumps in Genome.
Sample Input : 样本输入 :
genome: CGGACTCGACAGATGTGAAGAACGACAATGTGAAGACTCGACACGACAGAGTGAAGAGAAGAGGAAACATTGTAA
基因组:CGGACTCGACAGATGTGAAGAACGACAATGTGAAGACTCGACACGACAGAGTGAAGAGAAGAGGAAACATTGTAA
k: 5
L: 50
t: 4
Sample Output : 样本输出 :
CGACA GAAGA
加加
(defn get-indices [source target]
"Returns the indices for the substring target
found in source in ascending order. This includes overlaps."
(let
[search (java.util.regex.Pattern/compile (str "(?=(" target "))"))
matcher (re-matcher search source)
not-nil? (complement nil?)]
(defn inner [matcher]
(if (not-nil? (re-find matcher))
(cons (.start matcher) (inner matcher))))
(inner matcher)))
(defn get-frequent-kmer [source k]
"Gets the most frequenct k-mers of size k from source"
(let [max-val (val (apply max-key val (frequencies (partition k 1 source))))]
(map first (filter #(= (val %) max-val)
(frequencies (map (partial apply str) (partition k 1 source)))))))
(defn find-clumps [genome k L t]
(for [k-mer (get-frequent-kmer genome k)]
(let [indices (get-indices genome k-mer)]
(if (some true? (map #(<= (+ k (- (last %) (first %))) L)
(partition t 1 indices))) k-mer))))
Besides code style which has a couple things that can be improved, main problem I see is you're filtering k-mers on max-key val
and you're not considering t
at all on the initial filtering. 除了可以改进的代码样式外,我看到的主要问题是您正在对
max-key val
上的k-mers进行过滤,而在初始过滤中根本没有考虑t
。
When you find the most frequent l-mers of size k
you're just keeping the longer ones: 当您找到大小为
k
的最常见的L-单体时,您只需保留更长的一个:
(apply max-key val (frequencies (partition k 1 source)))
Since you filter by max-val 由于您通过max-val进行过滤
(filter #(= (val %) max-val)
And you're only analyzing those: 而且您只是在分析那些:
(for [k-mer (get-frequent-kmer genome k)]
The problem is that if t
is 4, but you have some 5-mers with more than 4 repeats, you're leaving the ones repeated 4 times out of the equation. 问题是,如果
t
为4,但是您有一些5聚体,且重复数超过4,则将这些重复数保留4次。
Here is some working code: 这是一些工作代码:
(defn k-mers
"Returns a seq of all k-mers in text."
[k text]
(map #(apply str %) (partition k 1 text)))
(defn most-frequent-k-mers
"Returns a seq of k-mers in text appearing at least t times."
[k t text]
(->> (k-mers k text)
(frequencies)
(filter #(<= t (second %)))
(map first)))
(defn find-clump
"Finds k-mers forming (L, t) clumps in text."
[k L t text]
(let [windows (partition L 1 text)]
(->> windows
(map #(most-frequent-k-mers k t %))
(map set)
(apply clojure.set/union))))
I think you should start from here. 我认为您应该从这里开始。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.