[英]Need to optimize my Clojure code which is taking too long
我有一個日志文件,大小為1.6 GB,包含200萬條記錄。 我正在將日志的內容讀入一個通道,執行一些轉換並將內容寫回到另一個通道。
最后,我將第二個通道的內容寫入文件中。
我的代碼運行良好,並且結果符合預期。 但是,整個操作大約需要45秒,這太長了。
我需要減少花費的時間。
(def reader-channel (delay (let [temp (chan)]
(go
(with-open [reader (clojure.java.io/reader "My_Big_Log")]
(doseq [ln (line-seq reader)]
(>! temp ln)))
(close! temp))
temp)))
(def writer-channel (chan))
(defn make-collection [] (loop [my-coll []] (let [item (<!! @reader-channel)]
(if (nil? item)
my-coll
(do (let [temp (re-find #"[a-z]+\.[a-z]+\.[a-z]+" item)]
(recur (conj my-coll temp))))))))
(def transformed-collection (delay (partition-by identity
(remove nil? (sort (make-collection))))))
(defn transform [] (go-loop [counter 0]
(if (>= counter (count @transformed-collection))
(do (close! writer-channel)
(println "Goodbye"))
(do (let [item (str "Referrer " (+ counter 1) ": "
(first (nth @transformed-collection counter)))]
(>! writer-channel item))
(let [item (str "Number of entries associated with this referrer: "
(count (nth @transformed-collection counter)))]
(>! writer-channel item))
(recur (inc counter))))))
(defn write-to-file [] (with-open [wrtr (clojure.java.io/writer "Result.txt" :append true)]
(loop []
(when-let [temp (<!! writer-channel)]
(.write wrtr (str temp "\n"))
(recur)))))
對於縮進和格式錯誤,我深表歉意。
每次通過循環, transform
都會執行多個非常昂貴的操作。 惰性序列的count
和nth
分別需要O(n)時間。 不要使用first
或next
懶惰地處理序列,而不必使用任何next
。
我不喜歡編寫代碼,但這似乎可以減少代碼。 我們要計算引薦來源網址的頻率,所以我們就可以這樣做:
(with-open [reader (clojure.java.io/reader "My_Big_Log")]
(frequencies
(map #(re-find #"[a-z]+\.[a-z]+\.[a-z]+")
(line-seq reader))))
對引薦來源網址進行計數,方法是生成所有200萬個引薦來源網址的列表,然后對其進行排序和分區,這意味着您需要攜帶大量不必要的數據。 這是在空間復雜度O(推薦人)而不是O(行)中做到的,這取決於您的日志,可能會大大減少。
我也不清楚您為什么使用core.async。 這將使這個簡單的計數增加很少,並且很難看到代碼中發生了什么。
最后-只是個人資料。 它會向您展示許多您可能不知道的有關代碼的有趣的事情。
對2M條目的sort
速度很慢。 在延遲序列上,plus count
和nth
也很昂貴。 您可以通過換能器避免它們(以及所有中間序列)。 在我的MBP上,2M記錄花費了大約5秒鍾。
(defn transform [input-f output-f]
(let [read-ch (chan 1 (comp (map (partial re-find #"[a-z]+\.[a-z]+\.[a-z]+"))
;; remove other lines
(remove nil?)
;; transducer bag is like a set but with counter. e.g. {"a.b.c" 1 "c.d.e" 3}
(bag)
;; make each map entry as a sequence element (["a.b.c" 1] ["c.d.e" 3])
cat
;; generate output lines
(map-indexed (fn [i [x cnt]]
[(str "Referrer " i ": " x)
(str "Number of entries associated with this referrer: " cnt)]))
;; flatten the output lines (["l1" "l2"] ["l3" "l4"]) => ("l1" "l2" "l3" "l4")
cat))
write-ch (chan)]
;; wire up read-ch to write-ch
(pipe read-ch write-ch true)
;; spin up a thread to read all lines into read-ch
(thread
(with-open [reader (io/reader input-f)]
(<!! (onto-chan read-ch (line-seq reader) true))))
;; write the counted lines to output
(with-open [wtr (io/writer output-f)]
(loop []
(when-let [temp (<!! write-ch)]
(.write wtr (str temp "\n"))
(recur))))))
(time
(transform "input.txt" "output.txt"))
;; => "Elapsed time: 5286.222668 msecs"
這是我使用的“一次性”計數袋:
(defn bag []
(fn [rf]
(let [state (volatile! nil)]
(fn
([] (rf))
([result] (if @state
(try
(rf result @state)
(finally
(vreset! state nil)))
(rf result)))
([result input]
(vswap! state update input (fnil inc 0))
result)))))
這是示例輸出:
Referrer 0: h.i.j
Number of entries associated with this referrer: 399065
Referrer 1: k.l.m
Number of entries associated with this referrer: 400809
Referrer 2: a.b.c
Number of entries associated with this referrer: 400186
Referrer 3: c.d.e
Number of entries associated with this referrer: 399667
Referrer 4: m.n.o
Number of entries associated with this referrer: 400273
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.