简体   繁体   中英

Split lines in clojure while reading from file

I am learning clojure at school and I have an exam coming up. I was just working on a few things to make sure I get the hang of it.

I am trying to read from a file line by line and as I do, I want to split the line whenever there is a ";".

Here is my code so far

(defn readFile []
  (map (fn [line] (clojure.string/split line #";"))
  (with-open [rdr (reader "C:/Users/Rohil/Documents/work.txt.txt")]
    (doseq [line (line-seq rdr)]
      (clojure.string/split line #";")
        (println line)))))

When I do this, I still get the output:

"I;Am;A;String;"

Am I missing something?

I'm not sure if you need this at school, but since Gary already gave an excellent answer, consider this as a bonus.

You can do elegant transformations on lines of text with transducers. The ingredient you need is something that allows you to treat the lines as a reducible collection and which closes the reader when you're done reducing:

(defn lines-reducible [^BufferedReader rdr]
  (reify clojure.lang.IReduceInit
    (reduce [this f init]
      (try
        (loop [state init]
          (if (reduced? state)
            @state
            (if-let [line (.readLine rdr)]
              (recur (f state line))
              state)))
        (finally
          (.close rdr))))))

Now you're able to do the following, given input work.txt :

I;am;a;string
Next;line;please

Count the length of each 'split'

(require '[clojure.string :as str])
(require '[clojure.java.io :as io])

(into []
      (comp
       (mapcat #(str/split % #";"))
       (map count))
      (lines-reducible (io/reader "/tmp/work.txt")))
;;=> [1 2 1 6 4 4 6]

Sum the length of all 'splits'

(transduce
 (comp
  (mapcat #(str/split % #";"))
  (map count))
 +
 (lines-reducible (io/reader "/tmp/work.txt")))
;;=> 24

Sum the length of all words until we find a word that is longer than 5

(transduce
 (comp
  (mapcat #(str/split % #";"))
  (map count))
 (fn
   ([] 0)
   ([sum] sum)
   ([sum l]
    (if (> l 5)
      (reduced sum)
      (+ sum l))))
 (lines-reducible (io/reader "/tmp/work.txt")))

or with take-while :

(transduce
 (comp
  (mapcat #(str/split % #";"))
  (map count)
  (take-while #(> 5 %)))
 +
 (lines-reducible (io/reader "/tmp/work.txt")))

Read https://tech.grammarly.com/blog/building-etl-pipelines-with-clojure for more details.

TL;DR embrace the REPL and embrace immutability

Your question was "what am I missing?" and to that I'd say you're missing one of the best features of Clojure, the REPL.

Edit : you might also be missing that Clojure uses immutable data structures so

consider this code snippet:

(doseq [x [1 2 3]]
   (inc x)
   (prn x))

This code does not print "2 3 4"

it prints "1 2 3" because x isn't a mutable variable.

During the first iteration (inc x) gets called, returns 2, and that gets thrown away because it wasn't passed to anything, then (prn x) prints the value of x which is still 1.

Now consider this code snippet:

(doseq [x [1 2 3]] (prn (inc x)))

During the first iteration the inc passes its return value to prn so you get 2

Long example:

I don't want to rob you of the opportunity to solve the problem yourself so I'll use a different problem as an example.

Given the file "birds.txt" with the data "1chicken\\n 2duck\\n 3Larry" you want to write a function that takes a file and returns a sequence of bird names

Lets break this problem down into smaller chunks:

first lets read the file and split it up into lines

(slurp "birds.txt") will give us the whole file a string

clojure.string/split-lines will give us a collection with each line as an element in the collection

(clojure.string/split-lines (slurp "birds.txt")) gets us ["1chicken" "2duck" "3Larry"]

At this point we could map some function over that collection to strip out the number like (map #(clojure.string/replace % #"\\d" "") birds-collection)

or we could just move that step up the pipeline when the whole file is one string.

Now that we have all of our pieces we can put them together in a functional pipeline where the result of one piece feeds into the next

In Clojure there is a nice macro to make this more readable, the -> macro

It takes the result of one computation and injects it as the first argument to the next

so our pipeline looks like this:

(-> "C:/birds.txt"
     slurp
     (clojure.string/replace #"\d" "") 
     clojure.string/split-lines)

last note on style, for Clojure functions you want to stick to kebab case so readFile should be read-file

I would keep it simple, and code it like this:

(ns tst.demo.core
  (:use tupelo.test)
  (:require [tupelo.core :as t]
            [clojure.string :as str] ))
(def text
 "I;am;a;line;
  This;is;another;one
  Followed;by;this;")

(def tmp-file-name "/tmp/lines.txt")

(dotest
  (spit tmp-file-name text) ; write it to a tmp file
  (let [lines       (str/split-lines (slurp tmp-file-name))
        result      (for [line lines]
                      (for [word (str/split line #";")]
                        (str/trim word)))
        result-flat (flatten result)]
(is= result
  [["I" "am" "a" "line"]
   ["This" "is" "another" "one"]
   ["Followed" "by" "this"]])

Notice that result is a doubly-nested (2D) matrix of words. The simplest way to undo this is the flatten function to produce result-flat :

(is= result-flat
  ["I" "am" "a" "line" "This" "is" "another" "one" "Followed" "by" "this"])))

You could also use apply concat as in:

(is= (apply concat result) result-flat)

If you want to avoid building up a 2D matrix in the first place, you can use a generator function (a la Python) via lazy-gen and yield from the Tupelo library :

(dotest
  (spit tmp-file-name text) ; write it to a tmp file
  (let [lines  (str/split-lines (slurp tmp-file-name))
        result (t/lazy-gen
                 (doseq [line lines]
                   (let [words (str/split line #";")]
                     (doseq [word words]
                       (t/yield (str/trim word))))))]

(is= result
  ["I" "am" "a" "line" "This" "is" "another" "one" "Followed" "by" "this"])))

In this case, lazy-gen creates the generator function. Notice that for has been replaced with doseq , and the yield function places each word into the output lazy sequence.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM