Counting Words III: Many solutions in Emacs Lisp
If we haven't yet, we'd need to first create the other input files.
This is more easily done with the script from the Bash post:
./countwords.sh -f
Requirements
We'll use these extra libraries:
(require 'dash) (require 'ht) (require 'f) (require 's)
Auxiliary functions
Let's start by the end. If we have the word-count pairs in an alist, we can go from there to the desired output with this:
(defun cw-sort-alist (alist) "Reverse-numerically sort cdr of ALIST, then convert it to a single string." (->> alist (--sort (string< (car it) (car other))) ; alphabetic when same number (--sort (> (cdr it) (cdr other))) (--map (concat (car it) " " (format "%d" (cdr it)))) (s-join "\n") (s-append "\n")))
So given something like this:
'(("the" . 2) ("foozle" . 1) ("bar" . 5) ("foo" . 2))
it returns this:
bar 5 foo 2 the 2 foozle 1
So it normalizes the sorting order of results: reverse-numeric, then alphabetic when the same.
Strict alphabetic sorting of words of same count wasn't required, but still. The first --sort
line would first sort the alist alphabetically by the car
, then the second --sort
would do it reverse-numerically by the cdr
.
(This solution works, but it'd sort the whole thing twice, and there should be a better way. Could --sort
be used to emulate Bash's sort -k2rn,1
— which, as far as I can tell, only sorts alphabetically the first field when there's a tie sorting reverse-numerically the second?)
I'll also use this:
(defun cw-split-words (str) "Split STR into list of words." (s-split "[^[:alpha:]]+" str t))
It's simplified from s-split-words
, with three differences:
[:alpha:]
instead of[:word:]
: the latter, depending on the mode's syntax table, may or may not include single quotes, whereas the former never does.- digits not included as words.
- no split of possible mixedCaseWords
With that, we force a split on single quotes regardless.
Solutions
Bad solution #0
How do we get to the alist? Let's start with a bad solution.
(defun cw-count-it-0 (file) "Given a FILE, count words." (let ((words (->> file f-read downcase cw-split-words (--remove (not (string-match-p "^[[:alpha:]]+$" it)))))) (->> words -uniq (-map (lambda (uniqw) `(,uniqw . ,(--count (equal uniqw it) words)))) cw-sort-alist)))
It reads the input file, downcases everything, split it into a list of words, remove numbers. So far so good. But then it does this: it creates a second list with the unique words, and for each of them go over the whole list of (repeated) words to count it.
I expected this to be horribly slow. For foo.txt
, it took 2 ms. For in/kjv÷10.txt
, it took... 67 seconds. So this is clearly out, yes? I won't re-run it.
Solutions #1 and #2: temporary buffers
To avoid this repetition, much better to go through the complete word list only once, and for every word there increase the counter for this word. In Emacs Lisp, this key-value pairs are usually stored in either an alist, a plist, or a hash table. The last one is faster.
Hash table operations in elisp feel much more intuitive with Wilfred Hughes' ht library, which is what I use here. Here's a first solution with it:
(defun cw-count-it-1 (file) "Given a FILE, count words." (let ((words (ht-create))) (with-temp-buffer (insert-file-contents file) (while (not (eobp)) (--> (buffer-substring-no-properties (point) (progn (forward-word-strictly) (point))) downcase (replace-regexp-in-string "[^[:alpha:]]+" "" it) (unless (s-blank-str? it) (ht-set! words it (1+ (or (ht-get words it) 0))))))) (->> words ht->alist cw-sort-alist)))
It inserts the whole file in a temp buffer, go over it "word" by "word", remove non-alpha characters from each "word", and keep setting the counters in the hash table. Then it converts it to an alist and use that previous alist-sorting function.
Any variations on that? We could instead read each line and split it into words:
(defun cw-count-it-2 (file) "Given a FILE, count words." (let ((words (ht-create))) (with-temp-buffer (insert-file-contents file) (while (search-forward-regexp ".*\n\\|.+" nil t) (let ((line (match-string 0))) (->> line downcase cw-split-words (mapc (lambda (word) (ht-set! words word (1+ (or (ht-get words word) 0))))))))) (->> words ht->alist cw-sort-alist)))
Solution #3: just slurp the whole text
We could also not use temp buffers at all. Instead, we slurp the whole string to memory, downcase it, split it into a list of "words", and update a hash table for every item on the list.
(defun cw-count-it-3 (file) "Given a FILE, count words." (let ((words (ht-create))) (->> file f-read downcase cw-split-words (mapc (lambda (word) (ht-set! words word (1+ (or (ht-get words word) 0)))))) (->> words ht->alist cw-sort-alist)))
Solution #4: workaround to read it by line
Ok, enough of this approach. What else could we try? Could we now pretend we're dealing with giant files and follow the constraint of not storing the whole thing in memory?
Let's combine:
- pieces of our previous functions,
- the use of
read-from-minibuffer
under Emacs in script mode, as seen here,
So we prepare a cw-byline.el
file, whose contents are:
;;; cw-byline.el -*- lexical-binding: t -*- (defun cw-main () "Print results of cw-process-stream to stdout." (cw-load-paths-and-requires) (princ (cw-process-stream))) (defun cw-load-paths-and-requires () "Load paths and requires." (let ((default-directory "~/.emacs.d/lisp")) (normal-top-level-add-to-load-path '(".")) (normal-top-level-add-subdirs-to-load-path) (add-to-list 'load-path default-directory)) (require 'dash) (require 's) (require 'ht)) (defun cw-process-stream () "Given a stream, return sorted word-count pair." (let ((words (ht-create))) (cw-read-and-process-lines (lambda (str) (cw-process-line words str))) (->> words ht->alist cw-sort-alist))) (defun cw-read-and-process-lines (fun) "Given a stream, process each line with function FUN" (let (line) (while (setq line (ignore-errors (read-from-minibuffer ""))) (funcall fun line)))) (defun cw-process-line (hs string) "For each word of STRING, update counter of hashtable HS" (->> (s-match-strings-all "\\b[[:alpha:]]+\\b" string) -flatten (mapc (lambda (word) (ht-set! hs (downcase word) (1+ (or (ht-get hs (downcase word)) 0))))))) (defun cw-sort-alist (alist) "Reverse-numerically sort cdr of ALIST, then convert it to a single string." (->> alist (--sort (string< (car it) (car other))) ; alphabetic when the same number (--sort (> (cdr it) (cdr other))) (--map (concat (car it) " " (format "%d" (cdr it)))) (s-join "\n") (s-append "\n"))) (cw-main) (kill-emacs 0) ;;; cw-byline.el ends here
byte-compile it:
(emacs-lisp-byte-compile "cw-byline.el")
and call it from a wrapper:
(defun cw-count-it-4 (file) "Given a FILE, count words." (shell-command-to-string (format "<%s emacs --quick --script %s/%s 2>/dev/null" file default-directory "cw-byline.elc")))
And... it works. And we obey the constraint of not loading the whole input file on memory.
Solution #5: workaround to read it in chunks of ~64 kB
What else?
So a person follows the odd route of using an elisp script that camouflages itself as a shell script. We could safely assume that said person has some shell there, no? Could there be, I don't know... wc
? sed
?
Let's then cheat a tiny little bit. Let's use:
wc
to see how many lines our input has and a line's maximum length, andsed
to gives us ranges of lines.
With that we could follow the constraint of no more than about 64 kB in memory at a time, and we don't need to call sed -n ${line}p
for every single line, which would be unnecessarily costly.
A quick test:
< in/kjv.txt wc -l < in/kjv.txt wc -L
outputs:
99817 77
That's almost 100k lines, none of which longer than 77 characters (+1 for ''). This is an English text, so at 1 B/char we'd fetch no more than about
(/ 64000 78) ;=> 820 lines at a time
And kjv.txt has 43325060 bytes, so roughly
(/ 43325060 64000) ;=> 676 shell-command-to-string calls to sed. Ouch.
And 10 times that for the larger file. (Why do I suspect this will be slow?)
Let's see.
(defun cw-count-it-5 (file) "Given a FILE, count words." (let* ((words (ht-create)) (chunk-size 64000) (num-lines (->> (format "wc -l <<< \"$(<'%s')\"" file) ;; The heredoc here ^ makes sure the string has a ;; trailing '\n', since -l is a newline count. shell-command-to-string string-to-number)) (max-len (->> (format "wc -L < '%s'" file) shell-command-to-string string-to-number)) (lns/chunk (/ chunk-size (1+ max-len))) ; 1+ for '\n' (num-chunks (ceiling (/ num-lines (float lns/chunk)))) beg end chunk-txt) (dotimes (i num-chunks) (setq beg (1+ (* i lns/chunk)) end (* (1+ i) lns/chunk) chunk-txt (shell-command-to-string (format "sed -n '%d,%d p' < '%s'" beg end file))) ;; From here, we simply use #3 over the chunk of text: (->> chunk-txt downcase cw-split-words (mapc (lambda (word) (ht-set! words word (1+ (or (ht-get words word) 0))))))) (->> words ht->alist cw-sort-alist)))
And... it works.
What else? To avoid the repeated calls to sed
over shell-command-to-string
we could call shell a single time and have it chop the file into many chunks and to be stored in temporary files for retrieval by elisp. But this would only make sense if the input file were indeed large, but in this case the time of writing all this to disk might easily be larger than the time of calling sed
repeatedly, plus the requirement of extra disk space as large as the original file. So let's drop this idea and have a look at how the functions we have so far compare in speed.
Speeds (and new solutions #6–9)
So let's byte-compile all the functions, load them, run over the inputs, and compare speeds.
Evaluation times, in seconds:
foo | kjv÷10 | kjv | kjv×10 | |
---|---|---|---|---|
cw-count-it-1 | 0.084 | 3.053 | 28.355 | 279.815 |
cw-count-it-2 | 0.044 | 0.764 | 6.836 | 64.942 |
cw-count-it-3 | 0.041 | 0.612 | 3.976 | 30.259 |
cw-count-it-4 | 0.389 | 1.769 | 13.714 | 135.399 |
cw-count-it-5 | 0.187 | 1.610 | 12.857 | 329.911 |
This isn't very fast, is it? My Bash one was doing kjv.txt
in 0.9s (well, "Bash" as in "glued calls to compiled command line utilities such as tr
and grep
"...). None of the above gets close to it. Why the slowness?
Profiling the fastest function:
M-x profiler-start (cw-count-it-3 "in/kjv÷10.txt") M-x profiler-report
it seems that about 40% of it comes from split-string
(from s-split
(from cw-split-words
)).
On the second fastest (#2), about 60% also due to split-string
.
But even if we eliminated the split-string
it would still be slower than piped Shell. So let's move on anyway.
Speeds will of course vary depending on the machine you run it on, but it's useful to see the relative speeds between them. Interesting that the fastest would be the solution that manipulates the whole string (#3) rather than the arguably more idiomatic ones using with-temp-buffer
(#1 and #2). Also interesting that, among these two, the one that splits matched lines is 4× as fast as the one that jumps by words.
But then... I'm using different functions to do this, so the comparison is not as direct. What if I changed the regexp passed to search-forward-regexp
on #2 and have it directly match words, and with that eliminated some of the following processing? I mean this:
(defun cw-count-it-6 (file) "Given a FILE, count words." (let ((words (ht-create))) (with-temp-buffer (insert-file-contents file) (while (search-forward-regexp "\\b[[:alpha:]]+\\b" nil t) (--> (match-string 0) downcase (ht-set! words it (1+ (or (ht-get words it) 0)))))) (->> words ht->alist cw-sort-alist)))
Speeds?
foo | kjv÷10 | kjv | kjv×10 | |
---|---|---|---|---|
cw-count-it-6 | 0.065 | 0.688 | 5.150 | 48.524 |
Better.
Profiler says 63% of it comes from search-forward-regexp
, and 60% of this comes from... ad-Advice-search-forward-regexp
, which says "Around-advice ‘fancy-narrow-around-advice’."
Ah, although I much like Malabarba's fancy-narrow, we could certainly do without it here.
But I'll not go down the path of disabling modes and removing advice. The improvement doesn't seem to be so substantial as to make these functions superfast. If you're an Emacs Lisper and see something here, let me know: I'm curious whether I'm missing something significant that could fix these speeds. After all, in another challenge Elisp vastly outperformed Bash.
Ok, what else could we try? What about not splitting the string at all? Do we really need to transform this into a list? We could instead walk the string, then for each character see if it's alphabetic, and, if it is, store it in a word. Once it isn't, the word is added to the hashtable, then emptied.
So we could do this:
(defun cw-count-it-7 (file) "Given a FILE, count words." (let* ((words (ht-create)) (addword (lambda (word) (ht-set! words word (1+ (or (ht-get words word) 0))))) (string (->> file f-read downcase)) (length (length string)) ;; Initialize variables (pos length) (word "") letter) (while (> pos 0) (setq letter (substring string (1- pos) pos)) (pcase (string-match "[[:alpha:]]" letter) (0 (setq word (concat letter word))) (_ (funcall addword word) (setq word ""))) (setq pos (1- pos))) (funcall addword word) (ht-remove! words "") (->> words ht->alist cw-sort-alist)))
Better yet, instead of treating the letter as a length-of-one string, we could deal with it as character and have the word be a list. Then it'd be converted to string before adding it to the hash table. This:
(defun cw-count-it-8 (file) "Given a FILE, count words." (let* ((words (ht-create)) (addword (lambda (charlist) (let ((word (concat charlist))) (ht-set! words word (1+ (or (ht-get words word) 0)))))) (string (->> file f-read downcase)) (length (length string)) ;; Initialize variables (pos length) word char) (while (> pos 0) (setq char (aref string (1- pos))) (if (and (>= char ?a) (<= char ?z)) (setq word (cons char word)) (funcall addword word) (setq word nil)) (setq pos (1- pos))) (funcall addword word) (ht-remove! words "") (->> words ht->alist cw-sort-alist)))
By the way:
- when I tried in #7 to use
(format "%s%s" letter words)
instead ofconcat
... it made the function 15% slower. - in both #7 and #8, downcasing the words or characters individually didn't seem to make much of a difference, so I left it be applied in the beginning, to the whole string. If anything, individual downcasing made it slightly worse: about 2.5% slower with #8, for example.
Speeds:
foo | kjv÷10 | kjv | kjv×10 | |
---|---|---|---|---|
cw-count-it-7 | 0.150 | 1.829 | 16.359 | 121.155 |
cw-count-it-8 | 0.036 | 0.624 | 5.335 | 40.810 |
Wow! What an enormous difference it made here to keep characters as characters. Is it because aref
is much faster than substring
or something like that? Both are written in C. And aref
surely treats args as arrays. Doesn't substring
also? No idea. But as a quick test:
(let* ((file "!/path/to/countwords/in/kjv.txt") (string (f-read file)) (len (length string)) (aref-time (->> (dotimes (i len) (aref string i)) benchmark-elapse (format "%.3f") read)) (subs-time (->> (dotimes (i len) (substring string i (1+ i))) benchmark-elapse (format "%.3f") read))) `(( aref substring) hline (,aref-time ,subs-time)))
aref | substring |
---|---|
2.293 | 8.68 |
So, yeah, aref
is much faster.
One more: what if we don't use hash tables at all? What about this:
- Insert the file in a temp buffer
- Go over it to replace any 1+ non-alpha with ''
- Downcase the whole thing
- Sort lines in reverse alphabetical
- Go to the beginning
- Loop: keep going down line by line; if the word is repeated, increase counter; repeat until word isn't repeated — then remove the repetitions and print the counter after the word. Reset counter.
- Sort by second field, numerically
- Reverse the lines.
- Return them.
Let's try it:
(defun cw-count-it-9 (file) "Given a FILE, count words." (with-temp-buffer (insert-file-contents file) (goto-char (point-min)) (while (re-search-forward "[^[:alpha:]]+" nil t) (replace-match "\n")) (downcase-region (point-min) (point-max)) (sort-lines t (point-min) (point-max)) (goto-char (point-max)) (while (bolp) (delete-char -1)) (insert "\n") (goto-char (point-min)) (let ((start (point)) (ct 1)) (while (not (eobp)) (pcase (looking-at "\\(.*\n\\)\\1") ('t (forward-line) (setq ct (1+ ct))) ('nil (delete-region start (point)) (end-of-line) (insert " " (number-to-string ct)) (forward-char) (setq start (point) ct 1))))) (sort-numeric-fields 2 (point-min) (point-max)) (reverse-region (point-min) (point-max)) (buffer-string)))
Speeds:
foo | kjv÷10 | kjv | kjv×10 | |
---|---|---|---|---|
cw-count-it-9 | 0.041 | 2.886 | 30.824 | ?! |
Not good, is it? I didn't wait to see how long it'd take for the last one: I got bored with seeing "Reordering buffer..." taking forever in the echo area. Sorting the whole 8.2M words is a no-go, especially with the sorting functions above, which are written in Elisp rather than C.
But what if it were in C? What if we used plain sort
? It needs a sequence, so we could go back to using cw-split-word
to create a list, convert it to a vector (whose positions can be accessed in constant time), then walk the vector to produce an alist directly, then sort it with cw-sort-alist
. Let's see.
(defun cw-count-it-10 (file) "Given a FILE, count words." (with-temp-buffer (let* ((sorted (--> file f-read downcase cw-split-words vconcat (sort it #'string>))) (idx 0) (max (1- (length sorted))) (pre (aref sorted idx)) (ct 1) alist cur) (while (< idx max) (setq idx (1+ idx) cur (aref sorted idx)) (if (equal pre cur) (setq ct (1+ ct)) (push (cons pre ct) alist) (setq pre cur ct 1))) (push (cons pre ct) alist) (cw-sort-alist alist))))
Speeds:
foo | kjv÷10 | kjv | kjv×10 | |
---|---|---|---|---|
cw-count-it-10 | 0.033 | 0.671 | 6.900 | 67.778 |
Not too bad. I don't like all that consing-in-a-loop, though. Is it perhaps the case that ht->alist
does a more efficient conversion? No idea. If it does, we could go back to using hash tables, but now adding each key-value only once, no updating (which is being done by the counter ct
, possible because of the sorting).
Instead of looking at ht's code, let's first try and run it to see if it's better.
(defun cw-count-it-11 (file) "Given a FILE, count words." (with-temp-buffer (let* ((sorted (--> file f-read downcase cw-split-words vconcat (sort it #'string>))) (idx 0) (max (1- (length sorted))) (pre (aref sorted idx)) (ct 1) (words (ht-create)) cur) (while (< idx max) (setq idx (1+ idx) cur (aref sorted idx)) (if (equal pre cur) (setq ct (1+ ct)) (ht-set! words pre ct) (setq pre cur ct 1))) (ht-set! words pre ct) (->> words ht->alist cw-sort-alist))))
Speeds:
foo | kjv÷10 | kjv | kjv×10 | |
---|---|---|---|---|
cw-count-it-11 | 0.043 | 0.498 | 6.959 | 67.911 |
Same thing, then.
By the way, for any input file given, the written outputs of the Bash solution and of all Elisp functions up to here are exactly the same. So, for example, md5sum kjv--*
returns:
c53368282b3dbcbd2f34d5007e1d3928 kjv--bash c53368282b3dbcbd2f34d5007e1d3928 kjv--cw-count-it-1 c53368282b3dbcbd2f34d5007e1d3928 kjv--cw-count-it-2 c53368282b3dbcbd2f34d5007e1d3928 kjv--cw-count-it-3 c53368282b3dbcbd2f34d5007e1d3928 kjv--cw-count-it-4 c53368282b3dbcbd2f34d5007e1d3928 kjv--cw-count-it-5 c53368282b3dbcbd2f34d5007e1d3928 kjv--cw-count-it-6 c53368282b3dbcbd2f34d5007e1d3928 kjv--cw-count-it-7 c53368282b3dbcbd2f34d5007e1d3928 kjv--cw-count-it-8 c53368282b3dbcbd2f34d5007e1d3928 kjv--cw-count-it-9 c53368282b3dbcbd2f34d5007e1d3928 kjv--cw-count-it-10 c53368282b3dbcbd2f34d5007e1d3928 kjv--cw-count-it-11
Back to the constrained exercise: a solution
Let's go back to the original exercise. So now we reintroduce these constraints:
- memory: don't read whole file into memory
- stdlib: only use the language's standard libraries
- words: anything separated by whitespace — ignore punctuation
Very well. So we'll need to adapt solution #4 — the only one that reads the file in chunks without (as is the case of #5) the liberties of calling sed
and wc
with shell-command-to-string
.
The output will of course be different, given that we're changing back to the definition of "word", which will then, now, also include: you!)
, zorobabel;
, (whoso
, 1:1
, etc.
The adaptation will simplify the splitting, so that it happens only on whitespace. It will also get rid of the four libraries the previous functions depend on. Here we go:
;;; cw-constrained.el --- Observe all the original constraints -*- lexical-binding: t -*- ;;; Commentary: ;; Observe all the constraints given by the original exercise. ;;; Code: (defun cw-main () "Print results of ‘cw-process-stream’ to stdout." (princ (cw-process-stream))) (defun cw-process-stream () "Given a stream, return sorted word-count pair." (let ((words (make-hash-table :test 'equal))) (cw-read-and-process-lines (lambda (str) (cw-process-line words str))) (cw-ht->sorted-alist words))) (defun cw-read-and-process-lines (fun) "Given a stream, process each line with function FUN." (let (line) (while (setq line (ignore-errors (read-from-minibuffer ""))) (funcall fun line)))) (defun cw-process-line (table str) "Update counter of hashtable TABLE for each word of string STR." (mapc (lambda (word) (unless (string= word "") (prog1 nil (puthash (downcase word) (1+ (or (gethash (downcase word) table) 0)) table)))) (split-string str "[ \n]+"))) (defun cw-ht->sorted-alist (table) "Convert hash-table TABLE to an alist, then to a sorted string. The cdr of the alist is reverse-numerically sorted, then converted to a single string." (let (res) (maphash (lambda (key value) (push (cons key value) res)) table) (setq res (sort res (lambda (it other) (string< (car it) (car other))))) (setq res (sort res (lambda (it other) (> (cdr it) (cdr other))))) (format "%s\n" (mapconcat 'identity (mapcar (lambda (it) (concat (car it) " " (format "%d" (cdr it)))) res) "\n")))) (cw-main) (kill-emacs 0) ;;; cw-constrained.el ends here
We byte-compile it, then wrap the script in a function:
(defun cw-count-it-A (file) "Given a FILE, count words." (shell-command-to-string (format "<%s emacs --quick --script %s/%s 2>/dev/null" file default-directory "cw-constrained.elc")))
in order to more easily check the times:
foo | kjv÷10 | kjv | kjv×10 | |
---|---|---|---|---|
cw-count-it-A | 0.221 | 0.978 | 7.875 | 75.822 |
At the terminal, the command would be:
(cd ./in && ln -nfs kjv{×,bible_x}10.txt) # the exercise's original filename ln -nfs cw-constrained.el cw-count-it-A.el # a symlink ln -nfs cw-constrained.elc cw-count-it-A.elc # a symlink time emacs --quick --script cw-count-it-A.elc < in/kjvbible_x10.txt >/dev/null
with output:
real 1m15.086s
So there we go. In a "mere 75 seconds" this piece of Emacs Lisp counts the word frequencies of an 8.21M-word text file. Let's by all means try and improve the speed — but if we stop for a minute to think that not too many decades ago our only choice would have been to do it by hand... 9 μs/word starts looking pretty good.
A faster solution for the expected original output
I've tested and it seems that the definition of word (split at spaces vs. split at non-alpha) has minimal impact on speed. If in #3 I change from cw-split-words
to (s-split "[ \n\r\t\v\f]+")
, we have this:
(defun cw-count-it-B (file) "Given a FILE, count words." (let ((words (ht-create))) (->> file f-read downcase (s-split "[ \n\r\t\v\f]+") (mapc (lambda (word) (ht-set! words word (1+ (or (ht-get words word) 0)))))) (ht-remove! words "") (->> words ht->alist cw-sort-alist)))
with these speeds:
foo | kjv÷10 | kjv | kjv×10 | |
---|---|---|---|---|
cw-count-it-B | 0.039 | 0.333 | 4.863 | 32.198 |
So while #A is the only solution that follows all the original constraints (including not using other libraries and not slurping the whole file into memory), #B is a second solution that outputs the expected sorted list of "words". At that, it's the fastest one here.
Their outputs match Ben Hoyt's output.txt
's MD5:
786f665f0b045ebb9ed20aea82fc1772 kjv×10--cw-count-it-A 786f665f0b045ebb9ed20aea82fc1772 kjv×10--cw-count-it-B
Updated benchmarking table of the Elisp solutions
Here's our consolidated table:
foo | kjv÷10 | kjv | kjv×10 | |
---|---|---|---|---|
cw-count-it-1 | 0.084 | 3.053 | 28.355 | 279.815 |
cw-count-it-2 | 0.044 | 0.764 | 6.836 | 64.942 |
cw-count-it-3 | 0.041 | 0.612 | 3.976 | 30.259 |
cw-count-it-4 | 0.389 | 1.769 | 13.714 | 135.399 |
cw-count-it-5 | 0.187 | 1.610 | 12.857 | 329.911 |
cw-count-it-6 | 0.065 | 0.688 | 5.150 | 48.524 |
cw-count-it-7 | 0.150 | 1.829 | 16.359 | 121.155 |
cw-count-it-8 | 0.036 | 0.624 | 5.335 | 40.810 |
cw-count-it-9 | 0.041 | 2.886 | 30.824 | ?! |
cw-count-it-10 | 0.033 | 0.671 | 6.900 | 67.778 |
cw-count-it-11 | 0.043 | 0.498 | 6.959 | 67.911 |
cw-count-it-A | 0.221 | 0.978 | 7.875 | 75.822 |
cw-count-it-B | 0.043 | 0.337 | 4.022 | 33.524 |
To create your own benchmarking table and test speeds on your machine, see part IV of this series here. You can then write your own functions and compare.
And if you write Emacs Lisp solutions whose approaches are different from the ones here, show us!
Comparing to other languages using benchmark.py
So maybe my speeds are so low compared to other languages because of the machine I ran them on? Easy to find out. Let's run my best solutions using an adapted version of Ben Hoyt's benchmark.py
and see.
So I simplified and modified his test.sh
:
#!/usr/bin/env bash # SPDX-FileCopyrightText: © 2021 Ben Hoyt # SPDX-FileCopyrightText: © flandrew # # SPDX-License-Identifier: MIT # # # Benchmark for Ben Hoyt's count-words exercise. # # Modified by me from Ben Hoyt's version: # - it has only a few languages # - it includes my solutions # - some extra auxiliary functions, plus much alignment # # For context and background, see his full article and repo here: # <https://benhoyt.com/writings/count-words/> # <https://github.com/benhoyt/countwords/> #------------------------------------------------------------------------------ ## Variables k10="kjvbible_x10.txt" out="output.txt" ## Functions normalize() { python3 normalize.py >"$out" ;} # Piped to awkp-norm() { awk '{ print $2, $1 }' | normalize ;} # Piped to #------------------------------------------------------------------------------ ## Main cat kjvbible.txt{,,,,,,,,,} >"$k10" # AWK echo AWK simple ; gawk -f simple.awk <"$k10" | normalize if command -v mawk > /dev/null; then echo AWK optimized ; mawk -f optimized.awk <"$k10" | normalize fi # C, C++ echo C simple ; gcc -O2 "$_".c -o "$_"-c ; ./"$_" <"$k10" | normalize echo C optimized ; gcc -O2 "$_".c -o "$_"-c ; ./"$_" <"$k10" | normalize echo C++ simple ; g++ -O2 "$_".cpp -o "$_"-cpp ; ./"$_" <"$k10" | normalize # Other echo Perl simple ; perl simple.pl <"$k10" | normalize echo Python simple ; python3 simple.py <"$k10" | normalize echo Python optimized ; python3 optimized.py <"$k10" | normalize echo Ruby simple ; ruby simple.rb <"$k10" | normalize echo Unix shell simple ; bash simple.sh <"$k10" | awkp-norm echo Unix shell optimized ; bash optimized.sh <"$k10" | awkp-norm # These have been normalized already (MD5-verified) echo Unix shell — mine ; bash countwords.sh <"$k10" echo Emacs Lisp — simple — mine ; bash cw-constrained.sh <"$k10" echo Emacs Lisp — optimized — mine; bash cw-byline.sh <"$k10"
And then simplified and modified his benchmark.py
, which I now output to an org table:
#!/usr/bin/env python3 # # SPDX-FileCopyrightText: © 2021 Ben Hoyt # SPDX-FileCopyrightText: © flandrew # # SPDX-License-Identifier: MIT # # # Benchmark for Ben Hoyt's count-words exercise. # # Modified by me from Ben Hoyt's version, so that: # - it has only a few languages # - it includes my solutions # - the output goes to an Org table # # For context and background, see his full article and repo here: # <https://benhoyt.com/writings/count-words/> # <https://github.com/benhoyt/countwords/> # Benchmark different versions and output results as Table # NOTE: run after ./test.sh (which compiles the binaries) import subprocess import time NUM_RUNS = 5 INPUT_FILENAME = 'kjvbible_x10.txt' def time_run(cmdline): cmdline = cmdline + ' <{} >/dev/null'.format(INPUT_FILENAME) times = [] for _ in range(NUM_RUNS): start = time.time() subprocess.run(cmdline, shell=True) elapsed = time.time() - start times.append(elapsed) return min(times) programs = [ ('Emacs Lisp', 'emacs --quick --script cw-count-it-A.elc', 'emacs --quick --script cw-count-it-B.elc', 'by me <-----'), ('Shell', 'bash countwords.sh', None, 'by me <-----'), ('Shell', 'bash simple.sh', 'bash optimized.sh', ''), ('C', './simple-c', './optimized-c', ''), ('C++', './simple-cpp', None, ''), ('Perl', 'perl simple.pl', None, 'by Charles Randall'), ('Python', 'python3 simple.py', 'python3 optimized.py', ''), ('Ruby', 'ruby simple.rb', None, ''), ('AWK', 'gawk -f simple.awk', 'mawk -f optimized.awk', 'optimized uses `mawk`'), ('grep', 'grep foobar', 'LC_ALL=C grep foobar', '`grep` baseline'), ('wc -w', 'wc -w', 'LC_ALL=C wc -w', '`wc` baseline'), ] times = [] for program in programs: lang, simple, optimized, _ = program print('Timing', lang, end=' ', flush=True) simple_time = time_run(simple) optimized_time = time_run(optimized) if optimized else None print('{:.2f} {:.2f}'.format(simple_time, optimized_time or 0)) times.append((program, simple_time, optimized_time)) times.sort(key=lambda x: x[1]) # sort by simple_time print() print('| Language | Simple | Optimized | Notes |') print('|---------------+--------+-----------+-------|') for program, simple_time, optimized_time in times: lang, _, _, notes = program if optimized_time is not None: print('| {:13} | {:6.2f} | {:9.2f} | {} |'.format(lang, simple_time, optimized_time, notes)) else: print('| {:13} | {:6.2f} | {:9} | {} |'.format(lang, simple_time, '', notes))
Then I created and byte-compiled a stand-alone elisp file to be called by these scripts:
;;; cw-count-it-B.el --- Changing the split -*- lexical-binding: t -*- ;;; Commentary: ;; This one changes the definition of word. ;;; Code: (defun cw-load-paths-and-requires () "Load paths and requires." (let ((default-directory "~/.emacs.d/lisp")) (normal-top-level-add-to-load-path '(".")) (normal-top-level-add-subdirs-to-load-path) (add-to-list 'load-path default-directory) (add-to-list 'load-path "~/path/to/countwords/")) (require 'countwords)) (cw-load-paths-and-requires) (f-write (cw-count-it-B "kjvbible_x10.txt") 'utf-8 "cw-count-it-B--out.txt") (kill-emacs 0) ;;; cw-count-it-B.el ends here
(The original cw-count-it-B
function requires a filename. So I'm hardcoding it here, which will ignore STDIN.)
Then ran it. Results:
Language | Simple | Optimized | Notes |
---|---|---|---|
`grep` | 0.06 | 0.06 | grep baseline |
`wc -w` | 2.55 | 0.80 | wc baseline |
C | 2.61 | 0.81 | |
Perl | 6.16 | by Charles Randall | |
C++ | 6.78 | ||
Shell | 9.57 | by me <----- | |
Ruby | 12.34 | ||
AWK | 14.94 | 3.65 | optimized uses `mawk` |
Python | 21.49 | 4.55 | |
Shell | 44.67 | 7.79 | |
Emacs Lisp | 80.46 | 30.88 | by me <----- :O! |
(where Emacs Lisp's simple = #A, optimized = #B)
(don't know whether my countwords.sh
here should count as simple or optimized)
So. My elapsed times above for the other languages are about 3 to 4 times that of Ben Hoyt's post, confirming my suspicion that part of the reason why my Emacs Lisp speeds were not great was because I had them all run on a slower machine.
But that certainly doesn't explain it all: see, the speeds of my best Emacs Lisp byte-compiled solutions still suck big time in comparison to all languages I'm benchmarking against here!
Why?
- Weak algorithm? (Even after a dozen different approaches? Unlikely.)
- Some intrinsic slowness of Elisp? (As in "4× slower than Bash"? Unlikely.)
So it puzzled me. I don't know why.
If you have clues, I'm all ears.