简体   繁体   English

需要优化我的Clojure代码,这花费了太长时间

[英]Need to optimize my Clojure code which is taking too long

I have a log file, which is 1.6 GB in size and contains 2 million records. 我有一个日志文件,大小为1.6 GB,包含200万条记录。 I am reading the contents of the log into a channel, performing some transformation and writing the contents back onto another channel. 我正在将日志的内容读入一个通道,执行一些转换并将内容写回到另一个通道。

Finally, I am writing the contents of the second channel into a file. 最后,我将第二个通道的内容写入文件中。

My code is working fine, and the results are as expected. 我的代码运行良好,并且结果符合预期。 However, the entire operation is taking ~45 seconds, which is too long. 但是,整个操作大约需要45秒,这太长了。

I need to reduce the time taken. 我需要减少花费的时间。

(def reader-channel (delay (let [temp (chan)]
                         (go
                           (with-open [reader (clojure.java.io/reader "My_Big_Log")]
                             (doseq [ln (line-seq reader)]
                               (>! temp ln)))
                           (close! temp))
                         temp)))



(def writer-channel (chan))

(defn make-collection [] (loop [my-coll []] (let [item (<!! @reader-channel)]
  (if (nil? item)
    my-coll
    (do (let [temp (re-find #"[a-z]+\.[a-z]+\.[a-z]+" item)]
          (recur (conj my-coll temp))))))))

(def transformed-collection (delay (partition-by identity
                                             (remove nil? (sort (make-collection))))))

(defn transform [] (go-loop [counter 0]
(if (>= counter (count @transformed-collection))
  (do (close! writer-channel)
      (println "Goodbye"))
  (do (let [item (str "Referrer " (+ counter 1) ": "
                      (first (nth @transformed-collection counter)))]
        (>! writer-channel item))
      (let [item (str "Number of entries associated with this referrer: "
                      (count (nth @transformed-collection counter)))]
        (>! writer-channel item))
    (recur (inc counter))))))

(defn write-to-file [] (with-open [wrtr (clojure.java.io/writer "Result.txt" :append true)]
(loop []
  (when-let [temp (<!! writer-channel)]
    (.write wrtr (str temp "\n"))
    (recur)))))

I apologise for bad indentation and formatting. 对于缩进和格式错误,我深表歉意。

transform is doing multiple tremendously expensive operations every time through the loop. 每次通过循环, transform都会执行多个非常昂贵的操作。 count and nth on a lazy sequence each take O(n) time. 惰性序列的countnth分别需要O(n)时间。 Instead of using either of these, process the sequence lazily with first and next . 不要使用firstnext懒惰地处理序列,而不必使用任何next

I don't like to code-golf, but this seems like it would reduce pretty simply. 我不喜欢编写代码,但这似乎可以减少代码。 We want to count the referrer frequency, so let's just do that: 我们要计算引荐来源网址的频率,所以我们就可以这样做:

  (with-open [reader (clojure.java.io/reader "My_Big_Log")]
    (frequencies
     (map #(re-find #"[a-z]+\.[a-z]+\.[a-z]+")
          (line-seq reader))))

Counting the referrers by generating a list of all 2 million of them then sorting it and partitioning it means that you carry around a large amount of unnecessary data. 对引荐来源网址进行计数,方法是生成所有200万个引荐来源网址的列表,然后对其进行排序和分区,这意味着您需要携带大量不必要的数据。 This does it in space complexity O(referrers) rather than O(lines) which depending on your logs might well be a huge reduction. 这是在空间复杂度O(推荐人)而不是O(行)中做到的,这取决于您的日志,可能会大大减少。

I'm also not clear why you are using core.async. 我也不清楚您为什么使用core.async。 It's going to add very little to this simple count and makes it very hard to see what's going on in the code. 这将使这个简单的计数增加很少,并且很难看到代码中发生了什么。

Finally - just profile. 最后-只是个人资料。 It'll show you lots of interesting things about your code you might not have known. 它会向您展示许多您可能不知道的有关代码的有趣的事情。

sort on 2M entries are slow. 对2M条目的sort速度很慢。 Plus count and nth are also expensive on lazy sequence. 在延迟序列上,plus countnth也很昂贵。 You can avoid them (together with all intermediary sequences) with transducer. 您可以通过换能器避免它们(以及所有中间序列)。 On my MBP, 2M records took ~5 seconds. 在我的MBP上,2M记录花费了大约5秒钟。

(defn transform [input-f output-f]
  (let [read-ch  (chan 1 (comp (map (partial re-find #"[a-z]+\.[a-z]+\.[a-z]+"))
                               ;; remove other lines
                               (remove nil?)
                               ;; transducer bag is like a set but with counter. e.g. {"a.b.c" 1  "c.d.e" 3}
                               (bag)
                               ;; make each map entry as a sequence element (["a.b.c" 1] ["c.d.e" 3])
                               cat
                               ;; generate output lines
                               (map-indexed (fn [i [x cnt]]
                                              [(str "Referrer " i ": " x)
                                               (str "Number of entries associated with this referrer: " cnt)]))
                               ;; flatten the output lines  (["l1" "l2"] ["l3" "l4"]) => ("l1" "l2" "l3" "l4")
                               cat))
        write-ch (chan)]

    ;; wire up read-ch to write-ch
    (pipe read-ch write-ch true)

    ;; spin up a thread to read all lines into read-ch
    (thread
      (with-open [reader (io/reader input-f)]
        (<!! (onto-chan read-ch (line-seq reader) true))))

    ;; write the counted lines to output
    (with-open [wtr (io/writer output-f)]
      (loop []
        (when-let [temp (<!! write-ch)]
          (.write wtr (str temp "\n"))
          (recur))))))

(time
 (transform "input.txt" "output.txt"))
;; => "Elapsed time: 5286.222668 msecs"

And here is the 'one-off' counting bag I used: 这是我使用的“一次性”计数袋:

(defn bag []
  (fn [rf]
    (let [state (volatile! nil)]
      (fn
        ([] (rf))
        ([result] (if @state
                    (try
                      (rf result @state)
                      (finally
                        (vreset! state nil)))
                    (rf result)))
        ([result input]
         (vswap! state update input (fnil inc 0))
         result)))))

Here is the sample output: 这是示例输出:

Referrer 0: h.i.j
Number of entries associated with this referrer: 399065
Referrer 1: k.l.m
Number of entries associated with this referrer: 400809
Referrer 2: a.b.c
Number of entries associated with this referrer: 400186
Referrer 3: c.d.e
Number of entries associated with this referrer: 399667
Referrer 4: m.n.o
Number of entries associated with this referrer: 400273

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM