Counting Words VI: Clojure

(See Intro, from years ago.)

My articles about the previous solutions, in Emacs Lisp and Bash, include a lot of detail. This one will be shorter, in great part because the solutions themselves are short and straightforward, thanks to the expressiveness of Clojure’s standard libraries.

But not much shorter: there’s still a lot to comment about.

In Emacs Lisp, strictly following the constraints (as I did in my cw-constrained.el solution) was a bit painful. First because reading files by line isn’t natural to Emacs, which almost always puts the whole thing into a buffer; getting around this is a bit awkward. Second because using only the language’s native libraries meant verbose workarounds to deal with hash tables and with the necessary re-sorting by value (the keys’ counts) for printing.

So it looked like this.
;;; cw-constrained.el --- Observe all the original constraints  -*- lexical-binding: t -*-

;;; Commentary:
;; Observe all the constraints given by the original exercise.

;;; Code:

(defun cw-main ()
  "Print results of ‘cw-process-stream’ to stdout."
  (princ (cw-process-stream)))

(defun cw-process-stream ()
  "Given a stream, return sorted word-count pair."
  (let ((words (make-hash-table :test 'equal)))
    (cw-read-and-process-lines (lambda (str) (cw-process-line words str)))
    (cw-ht->sorted-alist words)))

(defun cw-read-and-process-lines (fun)
  "Given a stream, process each line with function FUN."
  (let (line)
    (while (setq line (ignore-errors (read-from-minibuffer "")))
      (funcall fun line))))

(defun cw-process-line (table str)
  "Update counter of hashtable TABLE for each word of string STR."
  (mapc (lambda (word)
          (unless (string= word "")
            (prog1 nil
              (puthash (downcase word)
                       (1+ (or (gethash (downcase word) table)
                               0))
                       table))))
        (split-string str "[ \n]+")))

(defun cw-ht->sorted-alist (table)
  "Convert hash-table TABLE to an alist, then to a sorted string.
The cdr of the alist is reverse-numerically sorted, then converted to a single string."
  (let (res)
    (maphash (lambda (key value)
               (push (cons key value) res))
             table)
    (setq res (sort res (lambda (it other) (string< (car it) (car other)))))
    (setq res (sort res (lambda (it other) (>       (cdr it) (cdr other)))))
    (format "%s\n" (mapconcat 'identity
                              (mapcar (lambda (it)
                                        (concat  (car it) " "
                                                 (format "%d" (cdr it))))
                                      res)
                              "\n"))))

(cw-main)
(kill-emacs 0)

;;; cw-constrained.el ends here

This was then to be compiled and called from the terminal with:

emacs --quick --script cw-constrained.elc

In Clojure, strictly following the constraints was painless. Reading files by line is simple, and highly-performant HAMT-backed map literals are a central piece of the language, so plenty of tools to deal with them are available straight out of clojure.core.

Solutions

Let’s build the solutions in three steps.

We start by temporarily relaxing these two constraints:

  1. “read from file”, and
  2. “don’t read the whole file into memory”.

The idea is to ignore file I/O for now, so that we can look only at pure string transformations.
Once this looks ok, we put file I/O back in.

Ready?

Common to all

Some pieces are common to all solutions.

Namespace

All solutions use string functions.

The first solutions don’t use clojure.java.io, but it won’t hurt to require it.

(ns cw.core
  (:require (clojure  [string :as s])
            (clojure.java [io :as jio])))

Re-sorting the map

Since we’ll have to eventually re-sort the map by descending count and alphabetic strings, let’s write this comparator already:

(defn cw-by-dn-count-up-str [[as ac] [bs bc]]
  (compare [bc as] [ac bs]))

That’s it. Given two key–value pairs, destructure them, reverse-compare the counts (values), and then compare the strings (keys). We’ll pass this to a sorting function.

This is a bit like Bash’s:

# reverse numeric on 2nd field; resolve ties by alphabetic on 1st.
sort -k2,2rn -k1,1

I have never figured out how to do that in Elisp in one step.

From map to string

Once sorted, our map needs to become a single string in the required format. For example, from this:

{"bar,"   3
 "the"    2
 "12"     1
 "bar:"   1
 "bars'"  1
 "foo's"  1
 "foo."   1
 "foozle" 1}

To this:

bar, 3
the 2
12 1
bar: 1
bars' 1
foo's 1
foo. 1
foozle 1

There’re several ways to do that.

Way 1

We could do this:

(->> m
     (mapv #(s/join " " %))
     (s/join "\n")
     (#(str % "\n")))

or this:

(-> (mapv #(s/join " " %) m)
    (#(s/join "\n" %))
    (str "\n"))

but they look a bit confusing.

Way 2

We could do this:

(->> m
     (mapv (comp #(str % "\n")
                 #(s/join " " %)))
     (s/join))

which maps it with a composed lambda.

It means: join with a space every item in this vector of vectors, and add a trailing newline to the result. Then join these strings. Example:

(->> [["a" "b" "c"]
      ["d" "e" "f"]]
     (mapv (comp #(str % "\n")
                 #(s/join " " %)))
     (s/join))
=> "a b c\nd e f\n"
In Emacs Lisp, you could do that over a list of lists using Dash.
(->> '(("a" "b" "c")
       ("d" "e" "f"))
     (-map (-compose (-cut concat <> "\n")
                     (-cut s-join " " <>)))
     (s-join ""))
=> "a b c\nd e f\n"

Though using dash’s anaphorics is simpler and more common:

(->> '(("a" "b" "c")
       ("d" "e" "f"))
     (--map (concat (s-join " " it) "\n"))
     (s-join ""))
=> "a b c\nd e f\n"

or

(->> '(("a" "b" "c")
       ("d" "e" "f"))
     (--map (-> (s-join " " it)
                (concat "\n")))
     (s-join ""))
=> "a b c\nd e f\n"
Way 3

Yet I feel that this is simpler and more readable:

(->> m
     (mapv (fn [[k v]] (format "%s %s\n" k v)))
     s/join)

It means: destructure each pair as key and value, then format it as string using a space and trailing newline, then join all pairs.

So let’s go with that one. Or rather, let’s use a small variation — using str instead of format:

(str k " " v "\n")

Putting these together

So all our solutions will have this in common:

(ns cw.core
  (:require (clojure  [string :as s])
            (clojure.java [io :as jio])))

(defn cw-by-dn-count-up-str [[as ac] [bs bc]]
  (compare [bc as] [ac bs]))

(defn cw-map->str [m]
  (s/join (mapv (fn [[k v]] (str k " " v "\n"))
                m)))

(defn cw-make-string [m]
  (->> (dissoc m "")  (sort cw-by-dn-count-up-str)  (cw-map->str)))

That dissoc is a detail: remove the "" key from the map — don’t count it as a word.

String in, string out

All into memory

S1

So here’s a first solution. It takes a string as its argument, so for now we’re reading it all into memory.

(defn cw-count-it-S1 [s]
  (->> (s/split s #"\s+")
       (mapv s/lower-case)
       (frequencies)
       (cw-make-string)))

That’s all.

  1. First, we split the string at whitespace. We now have a vector of individual words. They possibly have punctuation, and removing them would be simple, but we’re just following the exercise’s tokenization rules.
  2. Then we lowercase everything.
  3. Then we run frequencies on it — and that function does all the heavy-lifting of counting the words and putting them into a map. It’s part of the standard library.
  4. Then we run cw-make-string, which removes the "" key, sorts it using our comparator, and converts the map to a final string.

A few notes:

  • We could have lowercased before splitting rather than after:

    (->> s s/lower-case
         (#(s/split % #"\s+")))
    
  • If you, Emacs Lisper, ever wondered what inspired Dash’s threading macros, that’s what. They’re idiomatic and widely used in Clojure, and as much of a joy to code with as they are in dash.el.
  • Newer versions of dash.el have -frequencies, in case you need this in Elisp. But unlike Clojure, this function isn’t part of Elisp’s core libraries, because dash isn’t.
  • The #(...%...) thing is a short and sweet way to express lambdas (which Clojure calls fns). In Clojure, these are native syntax. In Elisp, you can get something closely resembling it — but only by using a macro from a third-party library, because Emacs core didn’t accept #(...%...) as actual syntax.

    But even if you use it, you wouldn't be able to thread it directly in Elisp as you can in Clojure.
    (-> "Foo bar QUUX" downcase
        ((##s-split "[ \n\r\t]+" %)))
    !!> invalid-function
    

    You’d need to use the whole explicit lambda instead:

    (-> "Foo bar QUUX" downcase
        ((lambda (s) (s-split "[ \n\r\t]+" s))))
    => '("foo" "bar" "quux")
    

    But this syntax is discouraged, so if you try to compile it you’ll get a warning:

    Warning: Use of deprecated ((lambda (x) ...) ...) form
    

    “Ah!”, you think, “I’ll just use the newer lambda literal that came with Emacs 30” — so you try this and it works!

    (-> "Foo bar QUUX" downcase
        (#[(s) ((s-split "[ \n\r\t]+" s)) (t)]))
    => '("foo" "bar" "quux")
    

    But apart from the fact that the new literal is just as long as the old lambda was, and noisy, you’ll still get a compiler warning — a different one:

    Warning: Use ‘funcall’ instead of ‘#[(s) ((s-split "[ \n\r\t]+" s)) (t)]’ in the function position
    

    by which they mean that although this sort of thing works:

    ((lambda (x) (+ x 7)) 35) => 42
    

    it’s been deprecated (I don’t know when or why), so you have to add a funcall:

    (funcall (lambda (x) (+ x 7)) 35) => 42
    

    which, besides being long, wouldn’t work for our case because we’re first-threading.

    You could last-thread it to get rid of the lambda:

    (->> "Foo bar QUUX" downcase
         (s-split "[ \n\r\t]+"))
    => '("foo" "bar" "quux")
    

    but mixed threading positions is the reason why we’re using a lambda here in the first place: it’d fix this one but break others.

    So how do you mix-thread in Elisp? You could either thread anaphorically, with -->:

    (--> "Foo bar QUUX" downcase
         (s-split "[ \n\r\t]+" it))
    => '("foo" "bar" "quux")
    

    or use -as->, which is the same as Clojure’s as->:

    (-as-> "Foo bar QUUX" myvar
           (downcase myvar)
           (s-split "[ \n\r\t]+" myvar))
    => '("foo" "bar" "quux")
    

    whereas in Clojure you can either use as->:

    (as-> "Foo bar QUUX" myvar
      (s/lower-case myvar)
      (s/split myvar #"[ \n\r\t]+"))
    => '["foo" "bar" "quux"]
    

    (which has the downside of having to repeat the variable name at every step), or use “fn-sexp-as-first” (“lambda-sexp-as-car”) for the exceptions:

    (->> "Foo bar QUUX"
         (s/lower-case)
         (#(s/split % #"[ \n\r\t]+")))
    

    which is doing this:

    (#(s/split % #"[ \n\r\t]+") "foo bar quux") => '["foo" "bar" "quux"]
    

    which, were we not last-threading, would be more simply written as this:

    (s/split "foo bar quux" #"[ \n\r\t]+") => '["foo" "bar" "quux"]
    

    but we are last-threading, and that puts the last result in the first position.

    Using “fn-sexp-as-first” gives you flexibility to put the variable anywhere inside the fn: since the whole thing receives only one arg, whatever comes can, flexibly, be either first or last.

    Ok, enough threading-related digressions.

Testing it

Here’re the contents of the test input in/foo.txt:

"  The 12 foozle bar: the
Bars' bar, bar, foo's bar, foo."

and here’re its expected strict results using the exercise’s tokenization rules:

bar, 3
the 2
12 1
bar: 1
bars' 1
foo's 1
foo. 1
foozle 1

It works:

(def foo "  The 12 foozle bar: the
Bars' bar, bar, foo's bar, foo.")

(cw-count-it-S1 foo)
=> "bar, 3\nthe 2\n12 1\nbar: 1\nbars' 1\nfoo's 1\nfoo. 1\nfoozle 1\n"

File in, file out

All into memory

F1

Time to put back one of the constraints: we’ll now read the string from a file — but still read it all into memory.

We’ll also write the result to a file.

(defn cw-count-it-F1 [file-in file-out]
  (->> (slurp file-in)
       (#(s/split % #"\s+"))
       (mapv s/lower-case)
       (frequencies)
       (cw-make-string)
       (spit file-out)))

And then:

(def fooin "/tmp/count/foo.txt")
(def fooF1 "/tmp/count/fooF1.txt")
(cw-count-it-F1 fooin fooF1)

It works:

cat /tmp/count/fooF1.txt
bar, 3
the 2
12 1
bar: 1
bars' 1
foo's 1
foo. 1
foozle 1

The only changes were:

  • Instead of a string, the function now takes two args: filename of input, filename of output.
  • s changed to (slurp file-in), and we added a (spit file-out) at the end.

Just slurp and spit.

In Emacs Lisp, I'd often use f-read and f-write.
(-> (f-read file-in)
    do-stuff-with-string
    (f-write 'utf-8 file-out))

where do-stuff-with-string is the series of operations you want to do with the string, not an actual function.

But f.el is also not in Emacs core — so if you want to stick to the standard libraries, you’ll have to do something like this instead:

(let ((s (with-temp-buffer
           (insert-file-contents file-in)
           (buffer-string))))
  (setq s (do-stuff-with-string s))
  (write-region s nil file-out))

or

(with-temp-buffer
  (insert-file-contents file-in)
  (do-stuff-in-buffer)
  (write-region (point-min) (point-max) file-out))

where do-stuff-in-buffer is also not an actual function, but stands here for imperative buffer operations such as:

(while (re-search-forward "some-REGEXP" nil t)
  (push (match-string) ...)

Not all into memory — line by line

L1

So now let’s re-introduce the last remaining constraint: read the file line by line instead of slurping it all into memory.

(defn cw-count-it-L1 [file-in file-out]
  (with-open [rdr (jio/reader file-in)]
    (binding [*in* rdr]
      (loop [ln (read-line), m {}]
        (if ln
          (recur (read-line) (->> (s/split (s/lower-case ln) #"\s+")
                                  (frequencies)  (merge-with + m)))
          (->> m (cw-make-string) (spit file-out)))))))

I read this as follows:

Bind standard input to a reader of the input file. Create an empty map, and while there are lines to read, read one of them, lowercase it, and split it at whitespace. We’ll have words (with possible punctuation). Get their frequencies into a map. Merge this map with our accumulating map (which started empty). When the merge involve keys that show up in both maps, add their values.

When we run out of lines to read, pick the accumulated map and run cw-make-string — which removes the "" key, sorts by count descending and alphabetically when there’s a tie, then replaces every pair with a string: key space value newline, and joins these strings. Write the result to file-out.

If we test it:

(def fooin "/tmp/count/foo.txt")
(def fooL1 "/tmp/count/fooL1.txt")
(cw-count-it-L1 fooin fooL1)

we get the same result as the previous F1:

cat /tmp/count/fooL1.txt
bar, 3
the 2
12 1
bar: 1
bars' 1
foo's 1
foo. 1
foozle 1

only that now there’s a trailing newline and we’re reading the file line by line.

Running this over the actual large input data file produces the expected result: the hashes match.

Not all into memory — parallel

P0 and P1

Now, I think the L1 above is absolutely fine, and simple, and good enough for the problem at hand.

Yet you could say “But flandrew: isn’t this one of those pleasingly parallel problems? Couldn’t we, you know, do all that counting in parallel?”.

And you’d have a point. For example, we could do something like this:

(defn cw-count-it-P0 [file-in file-out]
  (with-open [rdr (jio/reader file-in)]
    (->> rdr  line-seq  (partition-all 10000)
         (pmap (fn [part]
                 (reduce #(merge-with
                           + %1 (frequencies
                                 (s/split (s/lower-case %2) #"\s+")))
                         {} part)))
         (apply merge-with +)
         (cw-make-string)
         (spit file-out))))

or this:

(defn cw-count-it-P1 [file-in file-out]
  (with-open [rdr (jio/reader file-in)]
    (->> rdr  line-seq  (partition-all 10000)
         (pmap #(loop [lns % m {}]
                  (if-let [ln (first lns)]
                    (recur (rest lns)
                           (->> (s/split (s/lower-case ln) #"\s+")
                                (frequencies)  (merge-with + m)))
                    m)))
         (apply merge-with +)
         (cw-make-string)
         (spit file-out))))

which I read as follows:

Bind standard input to a reader of the input file. Create a lazy sequence of all the lines of the file. Partition this sequence in chunks of 10000 lines.

Now process these lazy 10000-line chunks in parallel. For each of them, loop over its lines sequentially, and keep accumulating the word counts in a map.

After all chunks have been processed, return their maps, merge them, run cw-make-string — which removes the "" key, sorts by count descending and alphabetically when there’s a tie, then replaces every pair with a string: key space value newline, and joins these strings.

And if we test it:

(def fooin "/tmp/count/foo.txt")
(def fooP1 "/tmp/count/fooP1.txt")
(cw-count-it-P1 fooin fooP1)

we get the same results.

Likewise with the actual huge input file provided by the exercise.

Whether this will be faster or not, and by how much, depends on your machine — but also on the chunking size, because parallelizing itself adds an overhead. Frankly, for this particular exercise and its input size (not that huge), I’d stick with the simpler L1.

📆 2026-W15-4📆 2026-04-09