Counting Words II: Bash (piped Unix tools, actually)

Having introduced the challenge, here's my Bash solution.

#!/bin/bash
#
# SPDX-FileCopyrightText:  © flandrew
# SPDX-License-Identifier: GPL-3.0-or-later
#
# Bash solution to Ben Hoyt's count-words exercise.
#
#   See his full article for context and background:
#      <https://benhoyt.com/writings/count-words/>
#      <https://github.com/benhoyt/countwords/>


ctwords()   { tr -cs '[:alpha:]' '\n' | grep . |
                tr   '[:upper:]' '[:lower:]'   |
                #  ^ I tried sed "s/.*/\L&/" here, but... much slower
                LC_ALL=C sort | uniq -c        |
                LC_ALL=C sort -k1rn,2          |
                awk '{print $2,$1}'            ;}

setvars()   {    ori_in="in/kjvbible.txt"

                 foo_in="in/foo.txt"
               small_in="in/kjv÷10.txt"
                 med_in="in/kjv.txt"
                huge_in="in/kjv×10.txt"

                foo_out="out/foo--bash"
              small_out="out/kjv÷10--bash"
                med_out="out/kjv--bash"
               huge_out="out/kjv×10--bash" ;}

makefoo()   { : "  The 12 foozle bar: the"
              : "$_\nBars' bar, bar, foo's bar, foo."
              echo -e "$_" > "$foo_in" ;}

makesmall() { : "$(wc -l < "$med_in")"
              < "$med_in" sed "$((_/10))q" > "$small_in" ;}

makemed()   { if     [[ -f "$ori_in" ]]
              then ln -nfs "$ori_in" "$med_in"
              elif ! [[ -f "$med_in" ]]
              then echo "You have neither $ori_in nor $med_in" >&2
                   exit 1
              fi ;}

makehuge()  { cat "$med_in"{,,,,,,,,,}  > "$huge_in" ;}

makeins()   { [[ -f "$foo_in"   ]] || makefoo
              [[ -f "$med_in"   ]] || makemed
              [[ -f "$small_in" ]] || makesmall
              [[ -f "$huge_in"  ]] || makehuge  ;}

bench()     {
    mkdir -p out
    TIMEFORMAT="%R   ${foo_in##*/}"; time ctwords <   "$foo_in" >   "$foo_out"
    TIMEFORMAT="%R ${small_in##*/}"; time ctwords < "$small_in" > "$small_out"
    TIMEFORMAT="%R   ${med_in##*/}"; time ctwords <   "$med_in" >   "$med_out"
    TIMEFORMAT="%R  ${huge_in##*/}"; time ctwords <  "$huge_in" >  "$huge_out";}

main()      { cd "$(dirname "$0")" || exit
              case "$1" in
                  # Make standard input files
                  -f) setvars && makeins          ;;
                  # Make standard input files and run benchmarks on them
                  -b) setvars && makeins && bench ;;
                  # Count the words from STDIN
                  "") ctwords                     ;;
                  # Count the words from arbitrary input filename "$1"
                  *)  ctwords < "$1"              ;;
              esac ;}

main "$@"
exit 0

To run it:

chmod +x countwords.sh
./countwords.sh -b

Does it work?

cat out/foo--bash.txt

Yes:

bar 4
foo 2
the 2
bars 1
foozle 1
s 1

(Remember that:

cat  in/foo.txt

has

 The 12 foozle bar: the
Bars' bar, bar, foo's bar, foo.

and that by design I decided to also split words at ['].)

Runtimes in seconds, in an old machine:

foo.txt 0.013
kjv÷10.txt 0.099
kjv.txt 0.897
kjv×10.txt 9.704

The core function is impure through and through, since it makes (let me count) seven calls to five different Unix utilities, all of which are actually written in C.

Why didn't I try it in pure Bash?
Because it would be longer, slower, less readable, and unpleasant.

One-lining

My ctwords() ended up similar to Doug McIlroy's one-liner, which I had seen before. Did that influence mine? Possibly, although some time had passed between having seen it and writing this. I also can't think of how much more different than that I'd have approached it other than my usually being more inclined to use sed than tr. (Here I tested it and tr was faster.)

But the form of my ctwords() above doesn't make it a one-liner. So let me fix that by M-j'ing away with Fancy Joiner:

< "$1" tr -cs '[:alpha:]' '\n' | grep . | tr '[:upper:]' '[:lower:]' | LC_ALL=C sort | uniq -c | LC_ALL=C sort -k1rn,2 | awk '{print $2,$1}'

(this will leak to the right of the code box)

Of course, what qualifies as a one-liner is open to debate. Is 800 kB of minified JavaScript a one-liner?

Anyway, the story behind this problem is said to be about the battle between Knuth's literate programming and McIlroy's Unix one-liners. I suggest we settle this debate: just use Emacs' org-mode to both literately insert one-liners and write paragraphs commenting about it. If you come to a fork in the road, take it.

I guess that's it. I suspect this isn't as fertile a terrain for Bash as, say, FizzBuzz.

So let's move on and see Emacs Lisp.