简体   繁体   中英

Need to optimize my Clojure code which is taking too long

I have a log file, which is 1.6 GB in size and contains 2 million records. I am reading the contents of the log into a channel, performing some transformation and writing the contents back onto another channel.

Finally, I am writing the contents of the second channel into a file.

My code is working fine, and the results are as expected. However, the entire operation is taking ~45 seconds, which is too long.

I need to reduce the time taken.

(def reader-channel (delay (let [temp (chan)]
                         (go
                           (with-open [reader (clojure.java.io/reader "My_Big_Log")]
                             (doseq [ln (line-seq reader)]
                               (>! temp ln)))
                           (close! temp))
                         temp)))



(def writer-channel (chan))

(defn make-collection [] (loop [my-coll []] (let [item (<!! @reader-channel)]
  (if (nil? item)
    my-coll
    (do (let [temp (re-find #"[a-z]+\.[a-z]+\.[a-z]+" item)]
          (recur (conj my-coll temp))))))))

(def transformed-collection (delay (partition-by identity
                                             (remove nil? (sort (make-collection))))))

(defn transform [] (go-loop [counter 0]
(if (>= counter (count @transformed-collection))
  (do (close! writer-channel)
      (println "Goodbye"))
  (do (let [item (str "Referrer " (+ counter 1) ": "
                      (first (nth @transformed-collection counter)))]
        (>! writer-channel item))
      (let [item (str "Number of entries associated with this referrer: "
                      (count (nth @transformed-collection counter)))]
        (>! writer-channel item))
    (recur (inc counter))))))

(defn write-to-file [] (with-open [wrtr (clojure.java.io/writer "Result.txt" :append true)]
(loop []
  (when-let [temp (<!! writer-channel)]
    (.write wrtr (str temp "\n"))
    (recur)))))

I apologise for bad indentation and formatting.

transform is doing multiple tremendously expensive operations every time through the loop. count and nth on a lazy sequence each take O(n) time. Instead of using either of these, process the sequence lazily with first and next .

I don't like to code-golf, but this seems like it would reduce pretty simply. We want to count the referrer frequency, so let's just do that:

  (with-open [reader (clojure.java.io/reader "My_Big_Log")]
    (frequencies
     (map #(re-find #"[a-z]+\.[a-z]+\.[a-z]+")
          (line-seq reader))))

Counting the referrers by generating a list of all 2 million of them then sorting it and partitioning it means that you carry around a large amount of unnecessary data. This does it in space complexity O(referrers) rather than O(lines) which depending on your logs might well be a huge reduction.

I'm also not clear why you are using core.async. It's going to add very little to this simple count and makes it very hard to see what's going on in the code.

Finally - just profile. It'll show you lots of interesting things about your code you might not have known.

sort on 2M entries are slow. Plus count and nth are also expensive on lazy sequence. You can avoid them (together with all intermediary sequences) with transducer. On my MBP, 2M records took ~5 seconds.

(defn transform [input-f output-f]
  (let [read-ch  (chan 1 (comp (map (partial re-find #"[a-z]+\.[a-z]+\.[a-z]+"))
                               ;; remove other lines
                               (remove nil?)
                               ;; transducer bag is like a set but with counter. e.g. {"a.b.c" 1  "c.d.e" 3}
                               (bag)
                               ;; make each map entry as a sequence element (["a.b.c" 1] ["c.d.e" 3])
                               cat
                               ;; generate output lines
                               (map-indexed (fn [i [x cnt]]
                                              [(str "Referrer " i ": " x)
                                               (str "Number of entries associated with this referrer: " cnt)]))
                               ;; flatten the output lines  (["l1" "l2"] ["l3" "l4"]) => ("l1" "l2" "l3" "l4")
                               cat))
        write-ch (chan)]

    ;; wire up read-ch to write-ch
    (pipe read-ch write-ch true)

    ;; spin up a thread to read all lines into read-ch
    (thread
      (with-open [reader (io/reader input-f)]
        (<!! (onto-chan read-ch (line-seq reader) true))))

    ;; write the counted lines to output
    (with-open [wtr (io/writer output-f)]
      (loop []
        (when-let [temp (<!! write-ch)]
          (.write wtr (str temp "\n"))
          (recur))))))

(time
 (transform "input.txt" "output.txt"))
;; => "Elapsed time: 5286.222668 msecs"

And here is the 'one-off' counting bag I used:

(defn bag []
  (fn [rf]
    (let [state (volatile! nil)]
      (fn
        ([] (rf))
        ([result] (if @state
                    (try
                      (rf result @state)
                      (finally
                        (vreset! state nil)))
                    (rf result)))
        ([result input]
         (vswap! state update input (fnil inc 0))
         result)))))

Here is the sample output:

Referrer 0: h.i.j
Number of entries associated with this referrer: 399065
Referrer 1: k.l.m
Number of entries associated with this referrer: 400809
Referrer 2: a.b.c
Number of entries associated with this referrer: 400186
Referrer 3: c.d.e
Number of entries associated with this referrer: 399667
Referrer 4: m.n.o
Number of entries associated with this referrer: 400273

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM