簡體   English   中英

需要優化我的Clojure代碼,這花費了太長時間

[英]Need to optimize my Clojure code which is taking too long

我有一個日志文件,大小為1.6 GB,包含200萬條記錄。 我正在將日志的內容讀入一個通道,執行一些轉換並將內容寫回到另一個通道。

最后,我將第二個通道的內容寫入文件中。

我的代碼運行良好,並且結果符合預期。 但是,整個操作大約需要45秒,這太長了。

我需要減少花費的時間。

(def reader-channel (delay (let [temp (chan)]
                         (go
                           (with-open [reader (clojure.java.io/reader "My_Big_Log")]
                             (doseq [ln (line-seq reader)]
                               (>! temp ln)))
                           (close! temp))
                         temp)))



(def writer-channel (chan))

(defn make-collection [] (loop [my-coll []] (let [item (<!! @reader-channel)]
  (if (nil? item)
    my-coll
    (do (let [temp (re-find #"[a-z]+\.[a-z]+\.[a-z]+" item)]
          (recur (conj my-coll temp))))))))

(def transformed-collection (delay (partition-by identity
                                             (remove nil? (sort (make-collection))))))

(defn transform [] (go-loop [counter 0]
(if (>= counter (count @transformed-collection))
  (do (close! writer-channel)
      (println "Goodbye"))
  (do (let [item (str "Referrer " (+ counter 1) ": "
                      (first (nth @transformed-collection counter)))]
        (>! writer-channel item))
      (let [item (str "Number of entries associated with this referrer: "
                      (count (nth @transformed-collection counter)))]
        (>! writer-channel item))
    (recur (inc counter))))))

(defn write-to-file [] (with-open [wrtr (clojure.java.io/writer "Result.txt" :append true)]
(loop []
  (when-let [temp (<!! writer-channel)]
    (.write wrtr (str temp "\n"))
    (recur)))))

對於縮進和格式錯誤,我深表歉意。

每次通過循環, transform都會執行多個非常昂貴的操作。 惰性序列的countnth分別需要O(n)時間。 不要使用firstnext懶惰地處理序列,而不必使用任何next

我不喜歡編寫代碼,但這似乎可以減少代碼。 我們要計算引薦來源網址的頻率,所以我們就可以這樣做:

  (with-open [reader (clojure.java.io/reader "My_Big_Log")]
    (frequencies
     (map #(re-find #"[a-z]+\.[a-z]+\.[a-z]+")
          (line-seq reader))))

對引薦來源網址進行計數,方法是生成所有200萬個引薦來源網址的列表,然后對其進行排序和分區,這意味着您需要攜帶大量不必要的數據。 這是在空間復雜度O(推薦人)而不是O(行)中做到的,這取決於您的日志,可能會大大減少。

我也不清楚您為什么使用core.async。 這將使這個簡單的計數增加很少,並且很難看到代碼中發生了什么。

最后-只是個人資料。 它會向您展示許多您可能不知道的有關代碼的有趣的事情。

對2M條目的sort速度很慢。 在延遲序列上,plus countnth也很昂貴。 您可以通過換能器避免它們(以及所有中間序列)。 在我的MBP上,2M記錄花費了大約5秒鍾。

(defn transform [input-f output-f]
  (let [read-ch  (chan 1 (comp (map (partial re-find #"[a-z]+\.[a-z]+\.[a-z]+"))
                               ;; remove other lines
                               (remove nil?)
                               ;; transducer bag is like a set but with counter. e.g. {"a.b.c" 1  "c.d.e" 3}
                               (bag)
                               ;; make each map entry as a sequence element (["a.b.c" 1] ["c.d.e" 3])
                               cat
                               ;; generate output lines
                               (map-indexed (fn [i [x cnt]]
                                              [(str "Referrer " i ": " x)
                                               (str "Number of entries associated with this referrer: " cnt)]))
                               ;; flatten the output lines  (["l1" "l2"] ["l3" "l4"]) => ("l1" "l2" "l3" "l4")
                               cat))
        write-ch (chan)]

    ;; wire up read-ch to write-ch
    (pipe read-ch write-ch true)

    ;; spin up a thread to read all lines into read-ch
    (thread
      (with-open [reader (io/reader input-f)]
        (<!! (onto-chan read-ch (line-seq reader) true))))

    ;; write the counted lines to output
    (with-open [wtr (io/writer output-f)]
      (loop []
        (when-let [temp (<!! write-ch)]
          (.write wtr (str temp "\n"))
          (recur))))))

(time
 (transform "input.txt" "output.txt"))
;; => "Elapsed time: 5286.222668 msecs"

這是我使用的“一次性”計數袋:

(defn bag []
  (fn [rf]
    (let [state (volatile! nil)]
      (fn
        ([] (rf))
        ([result] (if @state
                    (try
                      (rf result @state)
                      (finally
                        (vreset! state nil)))
                    (rf result)))
        ([result input]
         (vswap! state update input (fnil inc 0))
         result)))))

這是示例輸出:

Referrer 0: h.i.j
Number of entries associated with this referrer: 399065
Referrer 1: k.l.m
Number of entries associated with this referrer: 400809
Referrer 2: a.b.c
Number of entries associated with this referrer: 400186
Referrer 3: c.d.e
Number of entries associated with this referrer: 399667
Referrer 4: m.n.o
Number of entries associated with this referrer: 400273

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM