Counting Words I: Introduction

I decided to play a bit, in both Bash and Emacs Lisp, with Ben Hoyt's word-counting exercise.
Here's his full article. And here's his repository, which includes the input text files used.

Constraints

I'll relax or change some of the problem's constraints, namely these:

  1. memory: don't read whole file into memory
  2. stdlib: only use the language's standard library functions
  3. words: anything separated by whitespace — ignore punctuation

Why relax the first? Because while in Common Lisp you can simply do this:

(with-open-file (stream filename)
                (loop for line = (read-line stream nil)
                      while line
                      do stuff-with-line))

it doesn't come naturally to read a file line by line in Emacs Lisp. Since our 44 MB file isn't prohibitively large, we'll start with the more idiomatic Elisp ways to handle it, which are to either slurp the whole string into memory and work on the string, or insert it into a temporary buffer and work on the buffer. Then we see how we could deal with a much larger file (say, 4GB).

Why relax the second? Because doing this without dash, f, s, and ht would feel a bit tedious and counterproductive. I'm used to them. And they are common (or "pretty standard"). So it's not as if I'm pulling from somewhere some count-word library to "solve" the problem with:

(require 'count-words)  ; Nope, apparently not available
(count-words "kjv×10.txt")

And the third? For both Bash and Elisp I'll use this instead:

words
maximum consecutive alphabetic characters.

Otherwise we'd have "said" and "said," and "said." as different words with separate counts. I understand the original simplification was to avoid it becoming a "tokenization battle". Arguably, it's simpler to split the string in whitespaces. But I'll instead interpret "ignore punctuation" as "strip it away". One effect is "we're" would be split into "we" and "re".

Input text files

Directory structure I ended up using:

.
├── in
│   ├── foo.txt
│   ├── kjv.txt
│   ├── kjv×10.txt
│   └── kjv÷10.txt
├── out
│   └── [outputs of countwords.*]
├── countwords.el
├── countwords.elc
├── countwords.sh
├── cw-byline.sh         ;;<--- Elisp, actually
└── cw-constrained.sh    ;;<--- Elisp, actually

First, I created a shorter filename for the standard file:

ln -s kjvbible.txt kjv.txt

Then saved to foo.txt a modified version (I added punctuations) of his minimal input text:

cat > in/foo.txt <<EOF
  The 12 foozle bar: the
Bars' bar, bar, foo's bar, foo.
EOF

This two-line sample is good for testing whether outputs are correct, but too small for comparing times. And the standard file may take a few too many seconds with some functions.

So let's create a smaller file for faster feedback on speeds. Say, the first ~10% of lines of the non-concatenated one.

: "$(wc -l < kjv.txt)"
sed "$((_/10))q" < kjv.txt > kjv÷10.txt

# Could have equally been either of:
# sed  -n "1,$((_/10))p"
# head -n   "$((_/10))p"

(I do the above mostly for the Elisp functions. More on them later.)

Then, no need to download the 10× concatenated file. You can concatenate it yourself. Here's one way:

cat kjv.txt{,,,,,,,,,} > kjv×10.txt

That's Bash's brace expansion: it repeats the filename 10 times.

But what if it were 100 instead of 10? Two options:

First: M-9 M-9 . And this works not only in Emacs, but also at the terminal, since much of Bash's shortcuts borrow from emacs. Try it.

But that's inelegant, long, and most people can't glance at 99 commas and know the count without counting.
Since 100 = 2²×5², we could do this instead:

cat kjv.txt{,}{,}{,,,,}{,,,,} > kjv×100.txt

Since that would be around 440MB, test it with a small file instead, say:

echo a b c > abc.txt
cat abc.txt{,}{,}{,,,,}{,,,,} > abc×100.txt
wc -l < abc×100.txt   # => 100

You could also do it with Bash's undocumented C-style for-loop-with-braces syntax:

echo a b c > abc.txt
for ((i=1;i<=100;i++)) { cat abc.txt ;} > abc×100.txt
wc -l < abc×100.txt

which takes about 24× the time the brace expansion takes, probably because of the repeated calls to cat. So you could, instead, create an array with the string "abc.txt" 100 times as elements, then pass it to cat:

echo a b c > abc.txt
params=($(seq 100 | sed "s/.*/abc.txt/"))
cat "${params[@]}" > abc×100.txt
wc -l < abc×100.txt

which is then almost as fast as with brace expansions.

But I digress.

So let's see solutions.

First, in Bash.