Counting Words III: Many solutions in Emacs Lisp

(See Intro, Bash)

If we haven't yet, we'd need to first create the other input files.
This is more easily done with the script from the Bash post:

./countwords.sh -f

Requirements

We'll use these extra libraries:

(require 'dash)
(require 'ht)
(require 'f)
(require 's)

Auxiliary functions

Let's start by the end. If we have the word-count pairs in an alist, we can go from there to the desired output with this:

(defun cw-sort-alist (alist)
  "Reverse-numerically sort cdr of ALIST, then convert it to a single string."
  (->> alist
       (--sort (string< (car it) (car other))) ; alphabetic when same number
       (--sort (>       (cdr it) (cdr other)))
       (--map  (concat  (car it) " " (format "%d" (cdr it))))
       (s-join   "\n")
       (s-append "\n")))

So given something like this:

'(("the"    . 2)
  ("foozle" . 1)
  ("bar"    . 5)
  ("foo"    . 2))

it returns this:

bar 5
foo 2
the 2
foozle 1

So it normalizes the sorting order of results: reverse-numeric, then alphabetic when the same.

Strict alphabetic sorting of words of same count wasn't required, but still. The first --sort line would first sort the alist alphabetically by the car, then the second --sort would do it reverse-numerically by the cdr.

(This solution works, but it'd sort the whole thing twice, and there should be a better way. Could --sort be used to emulate Bash's sort -k2rn,1 — which, as far as I can tell, only sorts alphabetically the first field when there's a tie sorting reverse-numerically the second?)

I'll also use this:

(defun cw-split-words (str)
  "Split STR into list of words."
  (s-split "[^[:alpha:]]+" str t))

It's simplified from s-split-words, with three differences:

[:alpha:] instead of [:word:]: the latter, depending on the mode's syntax table, may or may not include single quotes, whereas the former never does.
digits not included as words.
no split of possible mixedCaseWords

With that, we force a split on single quotes regardless.

Solutions

Bad solution #0

How do we get to the alist? Let's start with a bad solution.

(defun cw-count-it-0 (file)
  "Given a FILE, count words."
  (let ((words (->> file  f-read  downcase  cw-split-words
                    (--remove (not (string-match-p "^[[:alpha:]]+$" it))))))
    (->> words  -uniq
         (-map (lambda (uniqw)
                 `(,uniqw . ,(--count (equal uniqw it) words))))
         cw-sort-alist)))

It reads the input file, downcases everything, split it into a list of words, remove numbers. So far so good. But then it does this: it creates a second list with the unique words, and for each of them go over the whole list of (repeated) words to count it.

I expected this to be horribly slow. For foo.txt, it took 2 ms. For in/kjv÷10.txt, it took... 67 seconds. So this is clearly out, yes? I won't re-run it.

Solutions #1 and #2: temporary buffers

To avoid this repetition, much better to go through the complete word list only once, and for every word there increase the counter for this word. In Emacs Lisp, this key-value pairs are usually stored in either an alist, a plist, or a hash table. The last one is faster.

Hash table operations in elisp feel much more intuitive with Wilfred Hughes' ht library, which is what I use here. Here's a first solution with it:

(defun cw-count-it-1 (file)
  "Given a FILE, count words."
  (let ((words (ht-create)))
    (with-temp-buffer
      (insert-file-contents file)
      (while (not (eobp))
        (--> (buffer-substring-no-properties
              (point) (progn (forward-word-strictly) (point)))
          downcase
          (replace-regexp-in-string "[^[:alpha:]]+" "" it)
          (unless (s-blank-str? it)
            (ht-set! words it (1+ (or (ht-get words it) 0)))))))
    (->> words  ht->alist  cw-sort-alist)))

It inserts the whole file in a temp buffer, go over it "word" by "word", remove non-alpha characters from each "word", and keep setting the counters in the hash table. Then it converts it to an alist and use that previous alist-sorting function.

Any variations on that? We could instead read each line and split it into words:

(defun cw-count-it-2 (file)
  "Given a FILE, count words."
  (let ((words (ht-create)))
    (with-temp-buffer
      (insert-file-contents file)
      (while (search-forward-regexp ".*\n\\|.+" nil t)
        (let ((line (match-string 0)))
          (->> line  downcase  cw-split-words
               (mapc (lambda (word)
                       (ht-set! words word
                                (1+ (or (ht-get words word) 0)))))))))
    (->> words  ht->alist  cw-sort-alist)))

Solution #3: just slurp the whole text

We could also not use temp buffers at all. Instead, we slurp the whole string to memory, downcase it, split it into a list of "words", and update a hash table for every item on the list.

(defun cw-count-it-3 (file)
  "Given a FILE, count words."
  (let ((words (ht-create)))
    (->> file  f-read  downcase  cw-split-words
         (mapc (lambda (word)
                 (ht-set! words word (1+ (or (ht-get words word) 0))))))
    (->> words  ht->alist  cw-sort-alist)))

Solution #4: workaround to read it by line

Ok, enough of this approach. What else could we try? Could we now pretend we're dealing with giant files and follow the constraint of not storing the whole thing in memory?

Let's combine:

pieces of our previous functions,
the use of read-from-minibuffer under Emacs in script mode, as seen here,

So we prepare a cw-byline.el file, whose contents are:

;;; cw-byline.el  -*- lexical-binding: t -*-

(defun cw-main ()
  "Print results of cw-process-stream to stdout."
  (cw-load-paths-and-requires)
  (princ (cw-process-stream)))

(defun cw-load-paths-and-requires ()
  "Load paths and requires."
  (let ((default-directory  "~/.emacs.d/lisp"))
    (normal-top-level-add-to-load-path '("."))
    (normal-top-level-add-subdirs-to-load-path)
    (add-to-list 'load-path default-directory))
  (require 'dash)
  (require 's)
  (require 'ht))

(defun cw-process-stream ()
  "Given a stream, return sorted word-count pair."
  (let ((words (ht-create)))
    (cw-read-and-process-lines (lambda (str) (cw-process-line words str)))
    (->> words  ht->alist  cw-sort-alist)))

(defun cw-read-and-process-lines (fun)
  "Given a stream, process each line with function FUN"
  (let (line)
    (while (setq line (ignore-errors (read-from-minibuffer "")))
      (funcall fun line))))

(defun cw-process-line (hs string)
  "For each word of STRING, update counter of hashtable HS"
  (->> (s-match-strings-all "\\b[[:alpha:]]+\\b" string)
    -flatten
    (mapc (lambda (word)
            (ht-set! hs (downcase word)
                     (1+ (or (ht-get hs (downcase word)) 0)))))))

(defun cw-sort-alist (alist)
  "Reverse-numerically sort cdr of ALIST, then convert it to a single string."
  (->> alist
    (--sort (string< (car it) (car other))) ; alphabetic when the same number
    (--sort (>       (cdr it) (cdr other)))
    (--map  (concat  (car it) " " (format "%d" (cdr it))))
    (s-join   "\n")
    (s-append "\n")))

(cw-main)
(kill-emacs 0)

;;; cw-byline.el ends here

byte-compile it:

(emacs-lisp-byte-compile "cw-byline.el")

and call it from a wrapper:

(defun cw-count-it-4 (file)
  "Given a FILE, count words."
  (shell-command-to-string
   (format "<%s  emacs --quick --script %s/%s  2>/dev/null"
           file default-directory "cw-byline.elc")))

And... it works. And we obey the constraint of not loading the whole input file on memory.

Solution #5: workaround to read it in chunks of ~64 kB

What else?

So a person follows the odd route of using an elisp script that camouflages itself as a shell script. We could safely assume that said person has some shell there, no? Could there be, I don't know... wc? sed?

Let's then cheat a tiny little bit. Let's use:

wc to see how many lines our input has and a line's maximum length, and
sed to gives us ranges of lines.

With that we could follow the constraint of no more than about 64 kB in memory at a time, and we don't need to call sed -n ${line}p for every single line, which would be unnecessarily costly.

A quick test:

< in/kjv.txt  wc -l
< in/kjv.txt  wc -L

outputs:

99817
77

That's almost 100k lines, none of which longer than 77 characters (+1 for ''). This is an English text, so at 1 B/char we'd fetch no more than about

(/ 64000 78) ;=> 820 lines at a time

And kjv.txt has 43325060 bytes, so roughly

(/ 43325060 64000) ;=> 676 shell-command-to-string calls to sed. Ouch.

And 10 times that for the larger file. (Why do I suspect this will be slow?)

Let's see.

(defun cw-count-it-5 (file)
  "Given a FILE, count words."
  (let* ((words      (ht-create))
         (chunk-size 64000)
         (num-lines  (->> (format "wc -l <<< \"$(<'%s')\"" file)
                       ;; The heredoc here ^ makes sure the string has a
                       ;; trailing '\n', since -l is a newline count.
                       shell-command-to-string  string-to-number))
         (max-len    (->> (format "wc -L < '%s'" file)
                       shell-command-to-string  string-to-number))
         (lns/chunk  (/ chunk-size (1+ max-len))) ; 1+ for '\n'
         (num-chunks (ceiling (/ num-lines (float lns/chunk))))
         beg end chunk-txt)
    (dotimes (i num-chunks)
      (setq beg       (1+ (* i  lns/chunk))
            end       (* (1+ i) lns/chunk)
            chunk-txt (shell-command-to-string
                       (format "sed -n '%d,%d p' < '%s'" beg end file)))
      ;; From here, we simply use #3 over the chunk of text:
      (->> chunk-txt  downcase  cw-split-words
           (mapc (lambda (word)
                   (ht-set! words word (1+ (or (ht-get words word) 0)))))))
    (->> words  ht->alist  cw-sort-alist)))

And... it works.

What else? To avoid the repeated calls to sed over shell-command-to-string we could call shell a single time and have it chop the file into many chunks and to be stored in temporary files for retrieval by elisp. But this would only make sense if the input file were indeed large, but in this case the time of writing all this to disk might easily be larger than the time of calling sed repeatedly, plus the requirement of extra disk space as large as the original file. So let's drop this idea and have a look at how the functions we have so far compare in speed.

Speeds (and new solutions #6–9)

So let's byte-compile all the functions, load them, run over the inputs, and compare speeds.

Evaluation times, in seconds:

	foo	kjv÷10	kjv	kjv×10
cw-count-it-1	0.084	3.053	28.355	279.815
cw-count-it-2	0.044	0.764	6.836	64.942
cw-count-it-3	0.041	0.612	3.976	30.259
cw-count-it-4	0.389	1.769	13.714	135.399
cw-count-it-5	0.187	1.610	12.857	329.911

This isn't very fast, is it? My Bash one was doing kjv.txt in 0.9s (well, "Bash" as in "glued calls to compiled command line utilities such as tr and grep"...). None of the above gets close to it. Why the slowness?

Profiling the fastest function:

M-x profiler-start
(cw-count-it-3 "in/kjv÷10.txt")
M-x profiler-report

it seems that about 40% of it comes from split-string (from s-split (from cw-split-words)).
On the second fastest (#2), about 60% also due to split-string.

But even if we eliminated the split-string it would still be slower than piped Shell. So let's move on anyway.

Speeds will of course vary depending on the machine you run it on, but it's useful to see the relative speeds between them. Interesting that the fastest would be the solution that manipulates the whole string (#3) rather than the arguably more idiomatic ones using with-temp-buffer (#1 and #2). Also interesting that, among these two, the one that splits matched lines is 4× as fast as the one that jumps by words.

But then... I'm using different functions to do this, so the comparison is not as direct. What if I changed the regexp passed to search-forward-regexp on #2 and have it directly match words, and with that eliminated some of the following processing? I mean this:

(defun cw-count-it-6 (file)
  "Given a FILE, count words."
  (let ((words (ht-create)))
    (with-temp-buffer
      (insert-file-contents file)
      (while (search-forward-regexp "\\b[[:alpha:]]+\\b" nil t)
        (--> (match-string 0)  downcase
             (ht-set! words it
                      (1+ (or (ht-get words it) 0))))))
    (->> words  ht->alist  cw-sort-alist)))

Speeds?

	foo	kjv÷10	kjv	kjv×10
cw-count-it-6	0.065	0.688	5.150	48.524

Better.

Profiler says 63% of it comes from search-forward-regexp, and 60% of this comes from... ad-Advice-search-forward-regexp, which says "Around-advice ‘fancy-narrow-around-advice’."

Ah, although I much like Malabarba's fancy-narrow, we could certainly do without it here.

But I'll not go down the path of disabling modes and removing advice. The improvement doesn't seem to be so substantial as to make these functions superfast. If you're an Emacs Lisper and see something here, let me know: I'm curious whether I'm missing something significant that could fix these speeds. After all, in another challenge Elisp vastly outperformed Bash.

Ok, what else could we try? What about not splitting the string at all? Do we really need to transform this into a list? We could instead walk the string, then for each character see if it's alphabetic, and, if it is, store it in a word. Once it isn't, the word is added to the hashtable, then emptied.

So we could do this:

(defun cw-count-it-7 (file)
  "Given a FILE, count words."
  (let* ((words   (ht-create))
         (addword (lambda (word)
                    (ht-set! words word (1+ (or (ht-get words word) 0)))))
         (string  (->> file  f-read  downcase))
         (length  (length string))
         ;; Initialize variables
         (pos     length)
         (word    "")
         letter)
    (while (> pos 0)
      (setq letter (substring string (1- pos) pos))
      (pcase (string-match "[[:alpha:]]" letter)
        (0 (setq word (concat letter word)))
        (_ (funcall addword word)
           (setq word "")))
      (setq pos (1- pos)))
    (funcall addword word)
    (ht-remove! words "")
    (->> words  ht->alist  cw-sort-alist)))

Better yet, instead of treating the letter as a length-of-one string, we could deal with it as character and have the word be a list. Then it'd be converted to string before adding it to the hash table. This:

(defun cw-count-it-8 (file)
  "Given a FILE, count words."
  (let* ((words   (ht-create))
         (addword (lambda (charlist)
                    (let ((word (concat charlist)))
                      (ht-set! words word (1+ (or (ht-get words word) 0))))))
         (string  (->> file  f-read  downcase))
         (length  (length string))
         ;; Initialize variables
         (pos     length)
         word char)
    (while (> pos 0)
      (setq char (aref string (1- pos)))
      (if (and (>= char ?a)
               (<= char ?z))
          (setq word (cons char word))
        (funcall addword word)
        (setq word nil))
      (setq pos (1- pos)))
    (funcall addword word)
    (ht-remove! words "")
    (->> words  ht->alist  cw-sort-alist)))

By the way:

when I tried in #7 to use (format "%s%s" letter words) instead of concat... it made the function 15% slower.
in both #7 and #8, downcasing the words or characters individually didn't seem to make much of a difference, so I left it be applied in the beginning, to the whole string. If anything, individual downcasing made it slightly worse: about 2.5% slower with #8, for example.

Speeds:

	foo	kjv÷10	kjv	kjv×10
cw-count-it-7	0.150	1.829	16.359	121.155
cw-count-it-8	0.036	0.624	5.335	40.810

Wow! What an enormous difference it made here to keep characters as characters. Is it because aref is much faster than substring or something like that? Both are written in C. And aref surely treats args as arrays. Doesn't substring also? No idea. But as a quick test:

(let* ((file       "!/path/to/countwords/in/kjv.txt")
       (string     (f-read file))
       (len        (length string))
       (aref-time  (->> (dotimes (i len) (aref      string i))
                     benchmark-elapse (format "%.3f") read))
       (subs-time  (->> (dotimes (i len) (substring string i (1+ i)))
                     benchmark-elapse (format "%.3f") read)))
  `(( aref       substring)   hline
    (,aref-time ,subs-time)))

aref	substring
2.293	8.68

So, yeah, aref is much faster.

One more: what if we don't use hash tables at all? What about this:

Insert the file in a temp buffer
Go over it to replace any 1+ non-alpha with a line break
Downcase the whole thing
Sort lines in reverse alphabetical
Go to the beginning
Loop: keep going down line by line; if the word is repeated, increase counter; repeat until word isn't repeated — then remove the repetitions and print the counter after the word. Reset counter.
Sort by second field, numerically
Reverse the lines.
Return them.

Let's try it:

(defun cw-count-it-9 (file)
  "Given a FILE, count words."
  (with-temp-buffer
    (insert-file-contents file)
    (goto-char (point-min))
    (while (re-search-forward "[^[:alpha:]]+" nil t)
      (replace-match "\n"))
    (downcase-region (point-min) (point-max))
    (sort-lines t    (point-min) (point-max))
    (goto-char (point-max))
    (while (bolp)
      (delete-char -1))
    (insert "\n")
    (goto-char (point-min))
    (let ((start (point))
          (ct    1))
      (while (not (eobp))
        (pcase (looking-at "\\(.*\n\\)\\1")
          ('t   (forward-line)
                (setq ct (1+ ct)))
          ('nil (delete-region start (point))
                (end-of-line)
                (insert " " (number-to-string ct))
                (forward-char)
                (setq start (point)
                      ct    1)))))
    (sort-numeric-fields 2 (point-min) (point-max))
    (reverse-region        (point-min) (point-max))
    (buffer-string)))

Speeds:

	foo	kjv÷10	kjv	kjv×10
cw-count-it-9	0.041	2.886	30.824	?!

Not good, is it? I didn't wait to see how long it'd take for the last one: I got bored with seeing "Reordering buffer..." taking forever in the echo area. Sorting the whole 8.2M words is a no-go, especially with the sorting functions above, which are written in Elisp rather than C.

But what if it were in C? What if we used plain sort? It needs a sequence, so we could go back to using cw-split-word to create a list, convert it to a vector (whose positions can be accessed in constant time), then walk the vector to produce an alist directly, then sort it with cw-sort-alist. Let's see.

(defun cw-count-it-10 (file)
  "Given a FILE, count words."
  (with-temp-buffer
    (let* ((sorted (--> file  f-read  downcase  cw-split-words  vconcat
                        (sort it #'string>)))
           (idx    0)
           (max    (1- (length sorted)))
           (pre    (aref sorted idx))
           (ct     1)
           alist cur)
      (while (< idx max)
        (setq idx (1+ idx)
              cur (aref sorted idx))
        (if (equal pre cur)
            (setq ct (1+ ct))
          (push (cons pre ct) alist)
          (setq pre cur
                ct  1)))
      (push (cons pre ct) alist)
      (cw-sort-alist alist))))

Speeds:

	foo	kjv÷10	kjv	kjv×10
cw-count-it-10	0.033	0.671	6.900	67.778

Not too bad. I don't like all that consing-in-a-loop, though. Is it perhaps the case that ht->alist does a more efficient conversion? No idea. If it does, we could go back to using hash tables, but now adding each key-value only once, no updating (which is being done by the counter ct, possible because of the sorting).

Instead of looking at ht's code, let's first try and run it to see if it's better.

(defun cw-count-it-11 (file)
  "Given a FILE, count words."
  (with-temp-buffer
    (let* ((sorted (--> file  f-read  downcase  cw-split-words  vconcat
                        (sort it #'string>)))
           (idx    0)
           (max    (1- (length sorted)))
           (pre    (aref sorted idx))
           (ct     1)
           (words  (ht-create))
           cur)
      (while (< idx max)
        (setq idx (1+ idx)
              cur (aref sorted idx))
        (if (equal pre cur)
            (setq ct (1+ ct))
          (ht-set! words pre ct)
          (setq pre cur
                ct  1)))
      (ht-set! words pre ct)
      (->> words  ht->alist  cw-sort-alist))))

Speeds:

	foo	kjv÷10	kjv	kjv×10
cw-count-it-11	0.043	0.498	6.959	67.911

Same thing, then.

By the way, for any input file given, the written outputs of the Bash solution and of all Elisp functions up to here are exactly the same. So, for example, md5sum kjv--* returns:

c53368282b3dbcbd2f34d5007e1d3928  kjv--bash
c53368282b3dbcbd2f34d5007e1d3928  kjv--cw-count-it-1
c53368282b3dbcbd2f34d5007e1d3928  kjv--cw-count-it-2
c53368282b3dbcbd2f34d5007e1d3928  kjv--cw-count-it-3
c53368282b3dbcbd2f34d5007e1d3928  kjv--cw-count-it-4
c53368282b3dbcbd2f34d5007e1d3928  kjv--cw-count-it-5
c53368282b3dbcbd2f34d5007e1d3928  kjv--cw-count-it-6
c53368282b3dbcbd2f34d5007e1d3928  kjv--cw-count-it-7
c53368282b3dbcbd2f34d5007e1d3928  kjv--cw-count-it-8
c53368282b3dbcbd2f34d5007e1d3928  kjv--cw-count-it-9
c53368282b3dbcbd2f34d5007e1d3928  kjv--cw-count-it-10
c53368282b3dbcbd2f34d5007e1d3928  kjv--cw-count-it-11

Back to the constrained exercise: a solution

Let's go back to the original exercise. So now we reintroduce these constraints:

memory: don't read whole file into memory
stdlib: only use the language's standard libraries
words: anything separated by whitespace — ignore punctuation

Very well. So we'll need to adapt solution #4 — the only one that reads the file in chunks without (as is the case of #5) the liberties of calling sed and wc with shell-command-to-string.

The output will of course be different, given that we're changing back to the definition of "word", which will then, now, also include: you!), zorobabel;, (whoso, 1:1, etc.

The adaptation will simplify the splitting, so that it happens only on whitespace. It will also get rid of the four libraries the previous functions depend on. Here we go:

;;; cw-constrained.el --- Observe all the original constraints  -*- lexical-binding: t -*-

;;; Commentary:
;; Observe all the constraints given by the original exercise.

;;; Code:

(defun cw-main ()
  "Print results of ‘cw-process-stream’ to stdout."
  (princ (cw-process-stream)))

(defun cw-process-stream ()
  "Given a stream, return sorted word-count pair."
  (let ((words (make-hash-table :test 'equal)))
    (cw-read-and-process-lines (lambda (str) (cw-process-line words str)))
    (cw-ht->sorted-alist words)))

(defun cw-read-and-process-lines (fun)
  "Given a stream, process each line with function FUN."
  (let (line)
    (while (setq line (ignore-errors (read-from-minibuffer "")))
      (funcall fun line))))

(defun cw-process-line (table str)
  "Update counter of hashtable TABLE for each word of string STR."
  (mapc (lambda (word)
          (unless (string= word "")
            (prog1 nil
              (puthash (downcase word)
                       (1+ (or (gethash (downcase word) table)
                               0))
                       table))))
        (split-string str "[ \n]+")))

(defun cw-ht->sorted-alist (table)
  "Convert hash-table TABLE to an alist, then to a sorted string.
The cdr of the alist is reverse-numerically sorted, then converted to a single string."
  (let (res)
    (maphash (lambda (key value)
               (push (cons key value) res))
             table)
    (setq res (sort res (lambda (it other) (string< (car it) (car other)))))
    (setq res (sort res (lambda (it other) (>       (cdr it) (cdr other)))))
    (format "%s\n" (mapconcat 'identity
                              (mapcar (lambda (it)
                                        (concat  (car it) " "
                                                 (format "%d" (cdr it))))
                                      res)
                              "\n"))))

(cw-main)
(kill-emacs 0)

;;; cw-constrained.el ends here

We byte-compile it, then wrap the script in a function:

(defun cw-count-it-A (file)
  "Given a FILE, count words."
  (shell-command-to-string
   (format "<%s  emacs --quick --script %s/%s  2>/dev/null"
           file default-directory "cw-constrained.elc")))

in order to more easily check the times:

	foo	kjv÷10	kjv	kjv×10
cw-count-it-A	0.221	0.978	7.875	75.822

At the terminal, the command would be:

(cd ./in && ln -nfs kjv{×,bible_x}10.txt)     # the exercise's original filename
ln -nfs cw-constrained.el   cw-count-it-A.el  # a symlink
ln -nfs cw-constrained.elc  cw-count-it-A.elc # a symlink
time emacs --quick --script cw-count-it-A.elc < in/kjvbible_x10.txt >/dev/null

with output:

real 1m15.086s

So there we go. In a "mere 75 seconds" this piece of Emacs Lisp counts the word frequencies of an 8.21M-word text file. Let's by all means try and improve the speed — but if we stop for a minute to think that not too many decades ago our only choice would have been to do it by hand... 9 μs/word starts looking pretty good.

A faster solution for the expected original output

I've tested and it seems that the definition of word (split at spaces vs. split at non-alpha) has minimal impact on speed. If in #3 I change from cw-split-words to (s-split "[ \n\r\t\v\f]+"), we have this:

(defun cw-count-it-B (file)
  "Given a FILE, count words."
  (let ((words (ht-create)))
    (->> file  f-read  downcase  (s-split "[ \n\r\t\v\f]+")
         (mapc (lambda (word)
                 (ht-set! words word (1+ (or (ht-get words word) 0))))))
    (ht-remove! words "")
    (->> words  ht->alist  cw-sort-alist)))

with these speeds:

	foo	kjv÷10	kjv	kjv×10
cw-count-it-B	0.039	0.333	4.863	32.198

So while #A is the only solution that follows all the original constraints (including not using other libraries and not slurping the whole file into memory), #B is a second solution that outputs the expected sorted list of "words". At that, it's the fastest one here.

Their outputs match Ben Hoyt's output.txt's MD5:

786f665f0b045ebb9ed20aea82fc1772  kjv×10--cw-count-it-A
786f665f0b045ebb9ed20aea82fc1772  kjv×10--cw-count-it-B

Updated benchmarking table of the Elisp solutions

Here's our consolidated table:

	foo	kjv÷10	kjv	kjv×10
cw-count-it-1	0.084	3.053	28.355	279.815
cw-count-it-2	0.044	0.764	6.836	64.942
cw-count-it-3	0.041	0.612	3.976	30.259
cw-count-it-4	0.389	1.769	13.714	135.399
cw-count-it-5	0.187	1.610	12.857	329.911
cw-count-it-6	0.065	0.688	5.150	48.524
cw-count-it-7	0.150	1.829	16.359	121.155
cw-count-it-8	0.036	0.624	5.335	40.810
cw-count-it-9	0.041	2.886	30.824	?!
cw-count-it-10	0.033	0.671	6.900	67.778
cw-count-it-11	0.043	0.498	6.959	67.911
cw-count-it-A	0.221	0.978	7.875	75.822
cw-count-it-B	0.043	0.337	4.022	33.524

To create your own benchmarking table and test speeds on your machine, see part IV of this series here. You can then write your own functions and compare.

And if you write Emacs Lisp solutions whose approaches are different from the ones here, show us!

Comparing to other languages using `benchmark.py`

So maybe my speeds are so low compared to other languages because of the machine I ran them on? Easy to find out. Let's run my best solutions using an adapted version of Ben Hoyt's benchmark.py and see.

So I simplified and modified his test.sh:

#!/usr/bin/env bash

# SPDX-FileCopyrightText:  © 2021 Ben Hoyt
# SPDX-FileCopyrightText:  © flandrew
#
# SPDX-License-Identifier: MIT
#
#
# Benchmark for Ben Hoyt's count-words exercise.
#
#   Modified by me from Ben Hoyt's version:
#   - it has only a few languages
#   - it includes my solutions
#   - some extra auxiliary functions, plus much alignment
#
#   For context and background, see his full article and repo here:
#      <https://benhoyt.com/writings/count-words/>
#      <https://github.com/benhoyt/countwords/>

#------------------------------------------------------------------------------

## Variables
k10="kjvbible_x10.txt"
out="output.txt"

## Functions
normalize()  { python3 normalize.py >"$out" ;}       # Piped to
awkp-norm()  { awk '{ print $2, $1 }' | normalize ;} # Piped to

#------------------------------------------------------------------------------

## Main

cat kjvbible.txt{,,,,,,,,,} >"$k10"

# AWK
echo AWK simple                   ; gawk -f     simple.awk <"$k10" | normalize
if command -v mawk > /dev/null; then
    echo AWK optimized            ; mawk -f  optimized.awk <"$k10" | normalize
fi

# C, C++
echo C simple    ; gcc -O2 "$_".c   -o "$_"-c   ; ./"$_"   <"$k10" | normalize
echo C optimized ; gcc -O2 "$_".c   -o "$_"-c   ; ./"$_"   <"$k10" | normalize
echo C++ simple  ; g++ -O2 "$_".cpp -o "$_"-cpp ; ./"$_"   <"$k10" | normalize

# Other
echo Perl simple                  ; perl         simple.pl <"$k10" | normalize
echo Python simple                ; python3      simple.py <"$k10" | normalize
echo Python optimized             ; python3   optimized.py <"$k10" | normalize
echo Ruby simple                  ; ruby         simple.rb <"$k10" | normalize
echo Unix shell simple            ; bash         simple.sh <"$k10" | awkp-norm
echo Unix shell optimized         ; bash      optimized.sh <"$k10" | awkp-norm

# These have been normalized already (MD5-verified)
echo Unix shell — mine            ; bash     countwords.sh <"$k10"
echo Emacs Lisp — simple — mine   ; bash cw-constrained.sh <"$k10"
echo Emacs Lisp — optimized — mine; bash      cw-byline.sh <"$k10"

And then simplified and modified his benchmark.py, which I now output to an org table:

#!/usr/bin/env python3
#
# SPDX-FileCopyrightText:  © 2021 Ben Hoyt
# SPDX-FileCopyrightText:  © flandrew
#
# SPDX-License-Identifier: MIT
#
#
# Benchmark for Ben Hoyt's count-words exercise.
#
#   Modified by me from Ben Hoyt's version, so that:
#   - it has only a few languages
#   - it includes my solutions
#   - the output goes to an Org table
#
#   For context and background, see his full article and repo here:
#      <https://benhoyt.com/writings/count-words/>
#      <https://github.com/benhoyt/countwords/>

# Benchmark different versions and output results as Table
# NOTE: run after ./test.sh (which compiles the binaries)

import subprocess
import time

NUM_RUNS = 5
INPUT_FILENAME = 'kjvbible_x10.txt'

def time_run(cmdline):
    cmdline = cmdline + ' <{} >/dev/null'.format(INPUT_FILENAME)
    times = []
    for _ in range(NUM_RUNS):
        start = time.time()
        subprocess.run(cmdline, shell=True)
        elapsed = time.time() - start
        times.append(elapsed)
    return min(times)

programs = [
    ('Emacs Lisp',
     'emacs --quick --script cw-count-it-A.elc',
     'emacs --quick --script cw-count-it-B.elc',    'by me <-----'),
    ('Shell',  'bash countwords.sh',  None,         'by me <-----'),
    ('Shell',  'bash simple.sh',     'bash optimized.sh', ''),
    ('C',      './simple-c',         './optimized-c', ''),
    ('C++',    './simple-cpp',        None, ''),
    ('Perl',   'perl simple.pl',      None, 'by Charles Randall'),
    ('Python', 'python3 simple.py',  'python3 optimized.py', ''),
    ('Ruby',   'ruby simple.rb',      None, ''),
    ('AWK',    'gawk -f simple.awk', 'mawk -f optimized.awk',
     'optimized uses `mawk`'),
    ('grep',   'grep foobar', 'LC_ALL=C grep foobar', '`grep` baseline'),
    ('wc -w',  'wc -w',       'LC_ALL=C wc -w',       '`wc` baseline'),
]

times = []
for program in programs:
    lang, simple, optimized, _ = program
    print('Timing', lang, end=' ', flush=True)
    simple_time = time_run(simple)
    optimized_time = time_run(optimized) if optimized else None
    print('{:.2f} {:.2f}'.format(simple_time, optimized_time or 0))
    times.append((program, simple_time, optimized_time))
times.sort(key=lambda x: x[1])  # sort by simple_time

print()
print('| Language      | Simple | Optimized | Notes |')
print('|---------------+--------+-----------+-------|')
for program, simple_time, optimized_time in times:
    lang, _, _, notes = program
    if optimized_time is not None:
        print('| {:13} | {:6.2f} | {:9.2f} | {} |'.format(lang, simple_time, optimized_time, notes))
    else:
        print('| {:13} | {:6.2f} | {:9} | {} |'.format(lang, simple_time, '', notes))

Then I created and byte-compiled a stand-alone elisp file to be called by these scripts:

;;; cw-count-it-B.el --- Changing the split -*- lexical-binding: t -*-

;;; Commentary:
;; This one changes the definition of word.

;;; Code:
(defun cw-load-paths-and-requires ()
  "Load paths and requires."
  (let ((default-directory  "~/.emacs.d/lisp"))
    (normal-top-level-add-to-load-path '("."))
    (normal-top-level-add-subdirs-to-load-path)
    (add-to-list 'load-path default-directory)
    (add-to-list 'load-path "~/path/to/countwords/"))
  (require 'countwords))

(cw-load-paths-and-requires)
(f-write (cw-count-it-B "kjvbible_x10.txt")
         'utf-8
         "cw-count-it-B--out.txt")
(kill-emacs 0)

;;; cw-count-it-B.el ends here

(The original cw-count-it-B function requires a filename. So I'm hardcoding it here, which will ignore STDIN.)

Then ran it. Results:

Language	Simple	Optimized	Notes
`grep`	0.06	0.06	grep baseline
`wc -w`	2.55	0.80	wc baseline
C	2.61	0.81
Perl	6.16		by Charles Randall
C++	6.78
Shell	9.57		by me <-----
Ruby	12.34
AWK	14.94	3.65	optimized uses `mawk`
Python	21.49	4.55
Shell	44.67	7.79
Emacs Lisp	80.46	30.88	by me <----- :O!

(where Emacs Lisp's simple = #A, optimized = #B)
(don't know whether my countwords.sh here should count as simple or optimized)

So. My elapsed times above for the other languages are about 3 to 4 times that of Ben Hoyt's post, confirming my suspicion that part of the reason why my Emacs Lisp speeds were not great was because I had them all run on a slower machine.

But that certainly doesn't explain it all: see, the speeds of my best Emacs Lisp byte-compiled solutions still suck big time in comparison to all languages I'm benchmarking against here!

Why?

Weak algorithm? (Even after a dozen different approaches? Unlikely.)
Some intrinsic slowness of Elisp? (As in "4× slower than Bash"? Unlikely.)

So it puzzled me. I don't know why.
If you have clues, I'm all ears.

Requirements

Auxiliary functions

Solutions

Bad solution #0

Solutions #1 and #2: temporary buffers

Solution #3: just slurp the whole text

Solution #4: workaround to read it by line

Solution #5: workaround to read it in chunks of ~64 kB

Speeds (and new solutions #6–9)

Back to the constrained exercise: a solution

A faster solution for the expected original output

Updated benchmarking table of the Elisp solutions

Comparing to other languages using benchmark.py

Comparing to other languages using `benchmark.py`