Phones-to-Words Challenge II: Bash as a horribly-slow-but-surprisingly-concise alternative to Lisp and Java

"Bash has no native libraries."
"It's hard to read."
"Too slow!"

I could go on.

But boy, how concise it can be!

After having done in Emacs Lisp the exercise from this old Lisp as an alternative to Java study, I decided to tackle it in Bash. I wondered how it would look like. Would it be too slow? How to approach these recursive tree structures and table look-ups?

Here's what I found out.

Brace yourself

Excluding commented and empty lines, the size of Lispers' versions ranged from 50 to 181 lines of code, with median 134. My first Bash version ended up with 42 LoC — a bit over half the 72 LoC of my Emacs Lisp solution. I ended up producing some seven "workable" versions of it, ranging from 38 to 48 LoC, median 40.

Bash has no tree-like data types. There're strings, there're arrays, and that's it.

But! Nested braces are isomorphic to tree structures. So my approach was to abuse Bash's brace expansion and recursiveness to emulate branching and Cartesian products.

This sounds too abstract, so let me give you an example. Here's a Lisp tree:

'("a" ("bc" "e" "f") ("bd" "e" "f"))
;; Structure:
;; a
;; ├─bc
;; │ ├─e
;; │ └─f
;; └─bd
;;   ├─e
;;   └─f

So if these were solutions to a phone number 1234 and translation 1:a 2:b 3:[cd], 4:[ef], we'd output this:

1234: a bc e
1234: a bc f
1234: a bd e
1234: a bd f

Good. Now here's the exact same representation in Bash using braces — and a one-liner that expands it to the exact same output as above:

phone="1234"
tree="a%{bc%{e,f},bd%{e,f}}"  # % = separator
eval echo "$tree" | sed "y/ %/\n /" | sed "s/^/$phone: /"

Now, in our phone problem, chunks of matches are independent of each other. So we could, alternatively, represent things as a list of flat lists, and the actual output would be the Cartesian product of the internal lists.

In Elisp, this:

(require 'dash)
(let* ((output    "")
       (phone     "1234")
       (levels    '(("a") ("bc" "bd") ("e" "f")))
       (solutions (apply '-table-flat 'list levels)))
  (dolist (sol solutions)
    (setq output (concat output phone ": " (string-join sol " ") "\n")))
  output)

And in Bash, this:

phone="1234"
# % = separator
levels="a%{bc,bd}%{e,f}"  # this is flat, for Cartesian product; same expansion
eval echo "$levels" | sed "y/ %/\n /" | sed "s/^/$phone: /"

In practice it's actually a bit more complicated than that, because of different-sized chunkings. For example, three digits could become either 3, 2+1 or 1+2 words.

So we'll have a mix. Could end up looking somewhat like this:

phone="12345678"     # 1:x 2:y 3:[qz] 4:[mi] 5:[ky] 6:[cn] 7:[ao] 8:[tw]
echo x%{yz,yq}%{my%cat,I%know} | sed "y/ %/\n /" | sed "s/^/$phone: /"

12345678: x yz my cat
12345678: x yz I know
12345678: x yq my cat
12345678: x yq I know

You get the idea.

From here, the problem sort of reduces itself to generating a $solutions string for expansion. But first, we still need to decide how to do the look-ups.

In Emacs Lisp this was a no-brainer: it has hash tables (and the good ht library), and that's it. I didn't even think about other options. Whereas Bash... well, doesn't.

This left me curious about how the different worse-but-that's-what-we-got solutions would stack against each other. So there we go.

The solutions

Solution 1: Associative arrays

My first approach was to use an associative array. This is an option since Bash 4.3. I choose to restore umlauts to the German words, so that there's a one-to-one correspondence of digits-to-characters. This also helps bypass some quoting and escaping problems with the Umlauts-as-double-quotes of the source dictionary file. I keep a flag option for returning Umlauted results if desired, but to strictly follow the given instructions, I restore the original by default.

So the values will be the umlauted words. The keys will be the digits after the specified conversion table.

I run it with LC_ALL=C for speed improvements. This forces me to restore UTF-8 in one place to avoid multibyte problems.

Here's an almost comment-free version of the solution — in case you prefer it cleaner or want to try to understand the code without explanations. A highly-commented one follows it.

Uncommented:

umlautify()   { sed 's/A"/Ä/g;s/a"/ä/g;s/O"/Ö/g;s/o"/ö/g;s/u"/ü/g;s/U"/Ü/g';}
deumlautify() { sed 's/Ä/A"/g;s/ä/a"/g;s/Ö/O"/g;s/ö/o"/g;s/ü/u"/g;s/Ü/U"/g';}

2dig() { sed "s/ä/5/g;s/ü/7/g;s/ö/8/g;s/Ä/5/g;s/Ü/7/g;s/Ö/8/g" | tr A-Z a-z |
         tr jnqrwxdsyftamcivbkulopghze 11122233344556667778889990 ;}

makeaa() { local dic kvp;  declare -gA AA # kvp: key–value pair
           read -rd'\0' dic < <(< "$1" tr -cd 'a-zA-Z"\n' | umlautify)
           read -rd'\0' kvp < <(paste -d: <(2dig <<<"$dic") <(cat <<<"$dic"))
           : "$(sed -E 's|(.*):(.*)|["\1"]+="\2 "|' <<<"$kvp")"
           . <(printf 'AA=(%s)' "$_") ;}

bracify()  { : "$(tr '\n' ,)"
             [[ "$_" =~ ^,*$ ]]      && :      ""     || {
             [[ "$_" =~ ^[^,]*,$ ]]  && :  "${_%%,}"  ||
                                        : "{${_%%,}}"
             } &&  printf "%s" "$_" ;}

_match-one-phone() {
    phone="$(tr -cd '0-9'<<<"$1")"  szp="${#phone}";: "${npd:=true}"

    read -rd'\0' res < <(for ((i=szp;i>0;i--))
                         do printf "%s " "${AA[${phone::$i}]}"
                         done | grep -Eo "[^ ]+")

    (("${#res}">0)) && npd=true || {
            "$npd" && { npd=false; res="${phone::1}" ;} || return ;}

    for ((i=szp;i>0;i--))
    do  car="$(LC_ALL="en_US.UTF-8" grep -Pix ".{$i}" <<< "$res")"
        [[ "$car" ]] && {
            bracify <<< "$car"
            ((i==szp)) || { printf "%s" "%"  # % here is just a separator.
                            cdr="$("${FUNCNAME[0]}" "${phone:$i}")"
                            [[ "$cdr" ]] &&  bracify <<< "$cdr" ;}; echo ;}
    done | sed -E '/%$/d' ;}

match-one-phone() { while read -r x
                    do eval echo "$x"
                    done < <(_match-one-phone "$@") |
                        sed 'y/ %/\n /' | tr -d {} ;}

main() { [[ "$1" == "-u" ]] && { filter=cat; shift ;} || filter=deumlautify
         makeaa "${2:-in/sample-dict}"
         while read -r aphone
         do match-one-phone "$aphone" | sed -E "/./ s|^|$aphone: |"
         done < "${1:-in/sample-input}" | "$filter" ;}

LC_ALL=C  main "$@"

Commented version:

#!/usr/bin/env bash
#
# SPDX-FileCopyrightText:  © flandrew
# SPDX-License-Identifier: GPL-3.0-or-later
#
## p2w-aarray.sh
#
## Convert phone numbers to words

# Replace [AOUaou]\" with properly-umlauted single characters, and vice-versa.
# (A bit too hardcoded for my taste, but not worth generalizing it here.)
umlautify()   { sed 's/A"/Ä/g;s/a"/ä/g;s/O"/Ö/g;s/o"/ö/g;s/u"/ü/g;s/U"/Ü/g';}
deumlautify() { sed 's/Ä/A"/g;s/ä/a"/g;s/Ö/O"/g;s/ö/o"/g;s/ü/u"/g;s/Ü/U"/g';}

# Convert letters to digits according to the given rules.
# Because of LC_ALL=C, umlauted characters are multibyte.
2dig() { sed "s/ä/5/g;s/ü/7/g;s/ö/8/g;s/Ä/5/g;s/Ü/7/g;s/Ö/8/g" | tr A-Z a-z |
         tr jnqrwxdsyftamcivbkulopghze 11122233344556667778889990 ;}

# Make associative array, declare it as a global variable: key=digits of
# original dictionary, value=words. Though not required, I sanitize the
# dictionary for safety. Instead of looping through lines or through indices
# of a pair of indexed arrays and setting each pair of the associative array
# one by one, I generate the string that would set all key–value pairs at once
# — then evaluate that.
makeaa() { local dic kvp;  declare -gA AA # kvp: key–value pair
           read -rd'\0' dic < <(< "$1" tr -cd 'a-zA-Z"\n' | umlautify)
           read -rd'\0' kvp < <(paste -d: <(2dig <<<"$dic") <(cat <<<"$dic"))
           : "$(sed -E 's|(.*):(.*)|["\1"]+="\2 "|' <<<"$kvp")"
           . <(printf 'AA=(%s)' "$_") ;}

# If empty, returns nothing; if one entry, return it; if more than one entry,
# commatize it and brace it. This way, we can "abuse" Bash's brace expansion.
bracify()  { : "$(tr '\n' ,)"
             [[ "$_" =~ ^,*$ ]]      && :      ""     || {
             [[ "$_" =~ ^[^,]*,$ ]]  && :  "${_%%,}"  ||
                                        : "{${_%%,}}"
             } &&  printf "%s" "$_" ;}

# Solve the problem for a single phone number.
_match-one-phone() {
    # npd = not previously a digit
    phone="$(tr -cd '0-9'<<<"$1")"  szp="${#phone}";: "${npd:=true}"

    # List words matching all substrings of phone starting with its 1st digit.
    read -rd'\0' res < <(for ((i=szp;i>0;i--))
                         do printf "%s " "${AA[${phone::$i}]}"
                         done | grep -Eo "[^ ]+")

    # No matches? Accept 1st digit, remember that this option has been used.
    (("${#res}">0)) && npd=true || {
            "$npd" && { npd=false; res="${phone::1}" ;} || return ;}

    # UTF-8 here, or umlauted characters would not count as one.
    for ((i=szp;i>0;i--))
    do  car="$(LC_ALL="en_US.UTF-8" grep -Pix ".{$i}" <<< "$res")"
        [[ "$car" ]] && {
            bracify <<< "$car"
            # If equal to phone size, there's no right substring.
            ((i==szp)) || { printf "%s" "%"  # % here is just a separator.
                            # The below recursively calls this very
                            # function on the remaining digits of the string.
                            cdr="$("${FUNCNAME[0]}" "${phone:$i}")"
                            [[ "$cdr" ]] &&  bracify <<< "$cdr" ;}; echo ;}
    done | sed -E '/%$/d' ;}

# Brace-expand it.
match-one-phone() { while read -r x
                    # No, I don't like this. But our input is assumed safe and
                    # has been cleaned. That said, I'd be happy to replace it.
                    do eval echo "$x"
                    done < <(_match-one-phone "$@") |
                        sed 'y/ %/\n /' | tr -d {} ;}

# Go over the whole phone list.
main() { [[ "$1" == "-u" ]] && { filter=cat; shift ;} || filter=deumlautify
         makeaa "${2:-in/sample-dict}"
         while read -r aphone
         do match-one-phone "$aphone" | sed -E "/./ s|^|$aphone: |"
         done < "${1:-in/sample-input}" | "$filter" ;}

# Run it.
LC_ALL=C  main "$@"

(In all the following, I'll show only the commented versions.)

Why not simply indexed arrays instead?

I quickly tried indexed arrays for comparison, but gave up. Problems:

Assignments get slower as n goes up. This was also the case with associative arrays, but here much, much slower. After about 10k entries it choked, whereas with associative 73k was fine.
Numbers generated from the conversion aren't ready to be assigned as indices, because:
1. Longer words generate longer integers, which pass the limit. Judging by error messages, somewhere around 20 letters was the maximum. But the dictionary could have words up to 50 by specification.
2. Integer indices require managing decodings with leading zeros (simple to solve, though).

I did it because I was curious to test Bash's limitations. But then... I caught myself engaged in a hackish attempt to work around it by manually excluding long words with something like this:

sed "/Abänderungsanträge/d
     /Abänderungsvorschlag/d
     /Abbaufortschrittsmessung/d
     /Abfertigungsschalter/d
     (...)
     /Entbindungsabteilung/d
     /Integrationsfähigkeit/d
     (...)" <<<"$DICT"

upon which I realized that the right thing to do was to drop it and try something else. After all, it wouldn't be fair to the rightful owners of phone numbers 0-Vortriebsgeschwindigkeit and neu-5-achtundsiebzigjährig...

Solution 2: Associative arrays — generalized

I wondered if having more arrays would improve speeds. Since lookup is in linear time, it should be faster to first find out which of 10 arrays the answer is in and then search 7.3k-word arrays than to search a single 73k-words array. Or even better, in which of 100 730-word arrays.

So I coded a general solution that takes the exponent of number of arrays in base 10 as input. And n=0 should roughly result to the same speed of the previous one.

Makes no sense to use n=3 here, of course — 1000 73-word arrays shouldn't help much; and beyond that much less, and I didn't try.

#!/usr/bin/env bash
#
# SPDX-FileCopyrightText:  © flandrew
# SPDX-License-Identifier: GPL-3.0-or-later
#
## p2w-aarray-many.sh
#
## Convert phone numbers to words

# Replace [AOUaou]\" with properly-umlauted single characters, and vice-versa.
# (A bit too hardcoded for my taste, but not worth generalizing it here.)
umlautify()   { sed 's/A"/Ä/g;s/a"/ä/g;s/O"/Ö/g;s/o"/ö/g;s/u"/ü/g;s/U"/Ü/g';}
deumlautify() { sed 's/Ä/A"/g;s/ä/a"/g;s/Ö/O"/g;s/ö/o"/g;s/ü/u"/g;s/Ü/U"/g';}

# Convert letters to digits according to the given rules.
# Because of LC_ALL=C, umlauted characters are multibyte.
2dig() { sed "s/ä/5/g;s/ü/7/g;s/ö/8/g;s/Ä/5/g;s/Ü/7/g;s/Ö/8/g" | tr A-Z a-z |
         tr jnqrwxdsyftamcivbkulopghze 11122233344556667778889990 ;}

# Make 10ⁿ associative arrays, declare them as global variables: key=digits of
# original dictionary, value=words. Though not required, I sanitize the
# dictionary for safety. Instead of looping through lines or through indices
# of a pair of indexed arrays and setting each pair of the associative array
# one by one, I generate the string that would set all key–value pairs at once
# — then evaluate that.  $1=dictionary file; $2=exponent number of arrays
makeaa() { local dic kvp # kvp: key–value pair
           read -rd'\0' dic < <(< "$1" tr -cd 'a-zA-Z"\n' | umlautify)
           read -rd'\0' kvp < <(paste -d: <(2dig <<<"$dic") <(cat <<<"$dic"))
           case "$2" in
               0) : "$(<<<"$kvp" sed -E 's|(.*):(.*)|["\1"]+="\2 "|')"
                  . <(printf 'declare -gA AA; AA=(%s)' "$_") ;;
               *) while read -r i
                  do   : "$(<<<"$kvp" grep "^$i")"
                       : "$(<<<"$_"   sed -E 's|(.*):(.*)|["\1"]+="\2 "|')"
                       . <(printf 'declare -gA AA%s; AA%s=(%s)' "$i" "$i" "$_")
                  done < <(seq -w 0 "$((10**$2-1))")
           esac ;}

# If empty, returns nothing; if one entry, return it; if more than one entry,
# commatize it and brace it. This way, we can "abuse" Bash's brace expansion.
bracify()  { : "$(tr '\n' ,)"
             [[ "$_" =~ ^,*$ ]]      && :      ""     || {
             [[ "$_" =~ ^[^,]*,$ ]]  && :  "${_%%,}"  ||
                                        : "{${_%%,}}"
             } &&  printf "%s" "$_" ;}

# Solve the problem for a single phone number.
_match-one-phone() {
    # npd = not previously a digit
    phone="$(tr -cd '0-9'<<<"$1")"  szp="${#phone}"  pre="$3";: "${npd:=true}"
    declare -n ARRAYNAME="AA${phone::$pre}"

    # List words matching all substrings of phone starting with its 1st digit.
    read -rd'\0' res < <(for ((i=szp;i>0;i--))
                         do printf "%s " "${ARRAYNAME[${phone::$i}]}"
                         done | grep -Eo "[^ ]+")

    # No matches? Accept 1st digit, remember that this option has been used.
    (("${#res}">0)) && npd=true || {
            "$npd" && { npd=false; res="${phone::1}" ;} || return ;}

    # UTF-8 here, or umlauted characters would not count as one.
    for ((i=szp;i>0;i--))
    do  car="$(LC_ALL="en_US.UTF-8" grep -Pix ".{$i}" <<< "$res")"
        [[ "$car" ]] && {
            bracify <<< "$car"
            # If equal to phone size, there's no right substring.
            ((i==szp)) || { printf "%s" "%"  # % here is just a separator.
                            # The below recursively calls this very
                            # function on the remaining digits of the string.
                            cdr="$("${FUNCNAME[0]}" "${phone:$i}" _ "$pre")"
                            [[ "$cdr" ]] &&  bracify <<< "$cdr" ;}; echo ;}
    done | sed -E '/%$/d' ;}

# Brace-expand it.
match-one-phone() { while read -r x
                    # No, I don't like this. But our input is assumed safe and
                    # has been cleaned. That said, I'd be happy to replace it.
                    do eval echo "$x"
                    done < <(_match-one-phone "$@") |
                        sed 'y/ %/\n /' | tr -d {} ;}

# Go over the whole phone list.
# Usage: main [-u] [dictionary_file] [input_file] [n]
# Where 10ⁿ arrays (default: n=1)
main() { [[ "$1" == "-u" ]] && { filter=cat; shift ;} || filter=deumlautify
         makeaa "${2:-in/sample-dict}" "${3:-1}"
         while read -r aphone
         do match-one-phone "$aphone" _ "${3:-1}" | sed -E "/./ s|^|$aphone: |"
         done < "${1:-in/sample-input}" | "$filter" ;}

# Run it.
LC_ALL=C  main "$@"

Solution 3: Simple grep

What if instead of fancier data structures we simply grepped the result directly, from the one big dictionary text file? That's what I do here.

#!/usr/bin/env bash
#
# SPDX-FileCopyrightText:  © flandrew
# SPDX-License-Identifier: GPL-3.0-or-later
#
## p2w-grep.sh
#
## Convert phone numbers to words

# Replace [AOUaou]\" with properly-umlauted single characters, and vice-versa.
# (A bit too hardcoded for my taste, but not worth generalizing it here.)
umlautify()   { sed 's/A"/Ä/g;s/a"/ä/g;s/O"/Ö/g;s/o"/ö/g;s/u"/ü/g;s/U"/Ü/g';}
deumlautify() { sed 's/Ä/A"/g;s/ä/a"/g;s/Ö/O"/g;s/ö/o"/g;s/ü/u"/g;s/Ü/U"/g';}

# Convert letters to digits according to the given rules.
# Because of LC_ALL=C, umlauted characters are multibyte.
2dig() { sed "s/ä/5/g;s/ü/7/g;s/ö/8/g;s/Ä/5/g;s/Ü/7/g;s/Ö/8/g" | tr A-Z a-z |
         tr jnqrwxdsyftamcivbkulopghze 11122233344556667778889990 ;}

# Create a colon-separated lookup table from the digits words to the original.
makelook()  { read -rd'\0' dic < <(<"$1" tr -cd 'a-zA-Z"\n' | umlautify)
              paste -d: <(2dig <<< "$dic") <(cat <<< "$dic") | sort -d ;}

# If empty, returns nothing; if one entry, return it; if more than one entry,
# commatize it and brace it. This way, we can "abuse" Bash's brace expansion.
bracify()   { : "$(tr '\n' ,)"
              [[ "$_" =~ ^,*$ ]]      && :      ""     || {
              [[ "$_" =~ ^[^,]*,$ ]]  && :  "${_%%,}"  ||
                                         : "{${_%%,}}"
              } &&  printf "%s" "$_" ;}

# Solve the problem for a single phone number.
_match-one-phone() {
    # npd = not previously a digit
    phone="$(tr -cd 0-9 <<< "$1")" szp="${#phone}";: "${npd:=true}"

    # List words matching all substrings of phone starting with its 1st digit.
    read -rd'\0' res < <(for ((i=szp;i>0;i--))
                         do grep "^${phone::$i}:" <<< "$DLOOK"
                            # do sed -n "/^${phone::$i}:/p" <<< "$DLOOK"
                         done | cut -d: -f2)

    # No matches? Accept 1st digit, remember that this option has been used.
    (("${#res}">0)) && npd=true || {
            "$npd" && { npd=false; res="${phone::1}" ;} || return ;}

    # UTF-8 here, or umlauted characters would not count as one.
    for ((i=szp;i>0;i--))
    do  car="$(LC_ALL="en_US.UTF-8" grep -Pix ".{$i}" <<< "$res")"
        [[ "$car" ]] && {
            bracify <<< "$car"
            ((i==szp)) || { printf "%s" "%"  # % here is just a separator.
                            # The below recursively calls this very
                            # function on the remaining digits of the string.
                            cdr="$("${FUNCNAME[0]}" "${phone:$i}")"
                            [[ "$cdr" ]] && bracify <<< "$cdr" ;}; echo ;}
    done | sed -E '/%$/d' ;}

# Brace-expand it.
match-one-phone() { while read -r x
                    # No, I don't like this. But our input is assumed safe and
                    # has been cleaned. That said, I'd be happy to replace it.
                    do eval echo "$x"
                    done < <(_match-one-phone "$@") |
                        sed 'y/ %/\n /' | tr -d {} ;}

# Go over the whole phone list.
main() { [[ "$1" == "-u" ]] && { filter=cat; shift ;} || filter=deumlautify
         phones="${1:-in/sample-input}"  dic="${2:-in/sample-dict}"
         declare -g DLOOK; DLOOK="$(makelook "$dic")"
         while read -r aphone
         do match-one-phone "$aphone" | sed -E "/./ s|^|$aphone: |"
         done < "$phones" | "$filter" ;}

# Run it.
LC_ALL=C  main "$@"

Solution 4: Abusing the filesystem

So Bash doesn't really have trees. What a shame.

But do you know what does have trees? The filesystem. Access to filenames are "instantaneous". So what about we store each word as a directory named as itself, whose parent is its digits' translation?

This doesn't conform with the specified instructions of loading the whole dictionary in memory. But hey, we'll do it anyway to see what happens.

Let me illustrate. Suppose that our dictionary is very small and have only six words. Here they are, preceded by their computed digits according to the given translation:

3962803    Shirley
30883      sells
305390883  seashells
81         on
490        the
30539820   seashore

So a phone of "3962803" could be translated to "Shirley". So we want to create this empty directory:

./PMHT/3962803/Shirley

where PMHT stands for "Poor Man's Hash Table". Of course.

The beauty of this idea is that creating this "hash table" is simplicity itself. Given the pairs above in space-separated-fields stored as a variable:

space_table="\
3962803    Shirley
30883      sells
305390883  seashells
81         on
490        the
30539820   seashore"

slash_table="$(sed -E "s| +|/|" <<< "$space_table")"

a robust way to create your filesystem-based table is this:

mkdir -p ./PMHT && cd ./PMHT || exit
set --
while IFS='' read -r -d '' pair
do set -- "$@" "$pair"
done < <(tr '\n' '\0' <<< "$slash_table")

# mkdir -p "3962803/Shirley" "30883/sells" . . .
mkdir -p "$@"

This will pass all pairs, properly quoted, to a single call of mkdir. Then:

tree

gives us:

.
├── 305390883
│   └── seashells
├── 30539820
│   └── seashore
├── 30883
│   └── sells
├── 3962803
│   └── Shirley
├── 490
│   └── the
└── 81
    └── on

12 directories, 0 files

Passing one argument at a time to mkdir will be slow. We could manually chunk it:

# Space-separated chunks of 10000. Assumes sanitized input with no strange characters, etc.
while read -r chunk
do set -- $chunk
   mkdir -p "$@"
done < <(chunksize=10000; xargs -n"$chunksize" <<< "$slash_table")

For 73k, the string generation itself took less than 2 seconds on an old machine I used for testing, whereas passing it to mkdir and creating them took 2 minutes. Chunking speeds it up, and helps avoid exceeding argument list limits.

We can do better yet with GNU parallel, which will handle quoting and do it faster:

parallel -m mkdir -p {} <<< "$slash_table"

This took less than a minute to create.

Speaking of shells, here is the code:

#!/usr/bin/env bash
#
# SPDX-FileCopyrightText:  © flandrew
# SPDX-License-Identifier: GPL-3.0-or-later
#
## p2w-fsystem-ascii.sh
#
## Convert phone numbers to words

# Replace [AOUaou]\" with properly-umlauted single characters.
# (A bit too hardcoded for my taste, but not worth generalizing it here.)
umlautify()   { sed 's/A"/Ä/g;s/a"/ä/g;s/O"/Ö/g;s/o"/ö/g;s/u"/ü/g;s/U"/Ü/g';}

# Convert letters to digits according to the given rules.
# Because of LC_ALL=C, umlauted characters are multibyte.
2dig() { tr A-Z a-z | tr -d \" |
         tr jnqrwxdsyftamcivbkulopghze 11122233344556667778889990 ;}

# Create a filesystem-based hash table.
makefshash() { local dic; HTDIR="$2"
               read -rd'\0' dic < <(<"$1" tr -cd 'a-zA-Z"\n')
               mkdir -p "$HTDIR" &&
                   (cd "$HTDIR" || { echo "Oops" >&2; exit;}
                    paste <(2dig     <<< "$dic" | sed -E "s|^.|&/&|") -d/ \
                          <(tr \" =  <<< "$dic") | parallel -m mkdir -p {}) ;}

# If empty, returns nothing; if one entry, return it; if more than one entry,
# commatize it and brace it. This way, we can "abuse" Bash's brace expansion.
bracify()   { : "$(tr '\n' ,)"
              [[ "$_" =~ ^,*$ ]]      && :      ""     || {
              [[ "$_" =~ ^[^,]*,$ ]]  && :  "${_%%,}"  ||
                                         : "{${_%%,}}"
              } &&  printf "%s" "$_" ;}

# Solve the problem for a single phone number.
_match-one-phone() {
    # npd = not previously a digit
    phone="$(tr -cd 0-9 <<< "$1")" szp="${#phone}";: "${npd:=true}"

    # List words matching all substrings of phone starting with its 1st digit.
    read -rd'\0' res < <(for ((i=szp;i>0;i--))
                         do cd "$HTDIR/${phone::1}/${phone::$i}/" 2>/dev/null\
                                 && printf "%s\n" *  # glob!
                         done)

    # No matches? Accept 1st digit, remember that this option has been used.
    (("${#res}">0)) && npd=true || {
            "$npd" && { npd=false; res="${phone::1}" ;} || return ;}

    for ((i=szp;i>0;i--))
    do  car="$(grep -Pix "([^=]=*){$i}" <<< "$res")"
        [[ "$car" ]] && {
            bracify <<< "$car"
            ((i==szp)) || { printf "%s" "%"  # % here is just a separator.
                            # The below recursively calls this very
                            # function on the remaining digits of the string.
                            cdr="$("${FUNCNAME[0]}" "${phone:$i}")"
                            [[ "$cdr" ]] &&  bracify <<< "$cdr" ;}; echo ;}
    done | sed -E '/%$/d' ;}

# Brace-expand it.
match-one-phone() { while read -r x
                    # No, I don't like this. But our input is assumed safe and
                    # has been cleaned. That said, I'd be happy to replace it.
                    do eval echo "${x//=/\\\"}"
                    done < <(_match-one-phone "$@") |
                        sed 'y/ %/\n /' | tr -d {} ;}

# Go over the whole phone list.
# Usage: main [-u] [dictionary_file] [phones_file] [hashtable_path]
# All arguments are optional. Order above is strict.
main() { [[ "$1" == "-u" ]] && { filter=umlautify;  shift ;} || filter=cat
         phones="${1:-in/sample-input}"  dic="${2:-in/sample-dict}"
         [[ -f "$dic"    ]]  || { echo "File $dic missing"    >&2; exit ;}
         [[ -f "$phones" ]]  || { echo "File $phones missing" >&2; exit ;}

         # NOTE: 1. "Hash table"'s root path is hardcoded here if not chosen.
         #       2. It only creates paths if directory doesn't exist.
         #       3. Intelligently detects new dictionaries for hashing.
         # PMHT = "Poor Man's Hash Table"...
         htroot="${3:-$HOME/tmp/PMHT}"; declare -g HTDIR
         HTDIR="$htroot/wc_$(wc -l <"$dic")/$(sha256sum "$dic"| cut -c1-20)"

         # Directory would then end up being something like this:
         # $HOME/tmp/PMHT/wc_73113/291b4e05c0638d586e4a
         [[ -d "$HTDIR" ]] || { echo "Creating hash path: $HTDIR"        >&2
                                makefshash "$dic" "$HTDIR" && echo "Done">&2;}
         while read -r aphone
         do match-one-phone "$aphone" | sed -E "/./ s|^|$aphone: |"
         done < "$phones" | "$filter" ;}

# Run it.
LC_ALL=C  main "$@"

Solution 5: Binary search

There're no hash tables, but we could have binary search. For that, I use BSD's look, proper for dictionary search. The phone string is searched against a file with two fields (key:value), generated from the dictionary.

I create the file because there's no way that I know of to use look with a dictionary stored in memory.

So this works:

look -ft: key dictdatafile     # Some actual dictionary file,
                               # colon-separated (key:value)

whereas none of these work:

dictdata="\
aba:122
key:123
key:456
def:777
"
look -ft: key < <(echo -e "$dictdata")          # Process substitution
look -ft: key <(echo -e "$dictdata")            # Process substitution
look -ft: key <<< "$dictdata"                   # Here-strings
echo -e "$dictdata" | look -ft: key             # STDIN implicit
echo -e "$dictdata" | look -ft: key -           # STDIN explicit
echo -e "$dictdata" | look -ft: key /dev/stdin  # STDIN explicit (fake file)

mkfifo dictdatafifo                             # Named pipes
echo -e "$dictdata" > dictdatafifo
look -ft: key dictdatafifo

Instructions say dictionary must be loaded in memory. So here I had to deviate from the instructions.

Couldn't we have a function implementing binary search in Bash itself and receiving input from memory instead of from a file? We could, and I later ended up coding one such function, and it's very slow, as I suspected it would be.

I also didn't want to search for less usual utilities.

So look it was.

#!/usr/bin/env bash
#
# SPDX-FileCopyrightText:  © flandrew
# SPDX-License-Identifier: GPL-3.0-or-later
#
## p2w-look.sh
#
## Convert phone numbers to words

# Replace [AOUaou]\" with properly-umlauted single characters, and vice-versa.
# (A bit too hardcoded for my taste, but not worth generalizing it here.)
umlautify()   { sed 's/A"/Ä/g;s/a"/ä/g;s/O"/Ö/g;s/o"/ö/g;s/u"/ü/g;s/U"/Ü/g';}
deumlautify() { sed 's/Ä/A"/g;s/ä/a"/g;s/Ö/O"/g;s/ö/o"/g;s/ü/u"/g;s/Ü/U"/g';}

# Convert letters to digits according to the given rules.
# Because of LC_ALL=C, umlauted characters are multibyte.
2dig() { sed "s/ä/5/g;s/ü/7/g;s/ö/8/g;s/Ä/5/g;s/Ü/7/g;s/Ö/8/g" | tr A-Z a-z |
         tr jnqrwxdsyftamcivbkulopghze 11122233344556667778889990 ;}

# Create a colon-separated lookup table from the digits words to the original.
makelook()  { read -rd'\0' dic < <(<"$1" tr -cd 'a-zA-Z"\n' | umlautify)
              paste -d: <(2dig <<< "$dic") <(cat <<< "$dic") | sort -d ;}

# If empty, returns nothing; if one entry, return it; if more than one entry,
# commatize it and brace it. This way, we can "abuse" Bash's brace expansion.
bracify()   { : "$(tr '\n' ,)"
              [[ "$_" =~ ^,*$ ]]      && :      ""     || {
              [[ "$_" =~ ^[^,]*,$ ]]  && :  "${_%%,}"  ||
                                         : "{${_%%,}}"
              } &&  printf "%s" "$_" ;}

# Solve the problem for a single phone number.
_match-one-phone() {
    # npd = not previously a digit
    dict="$1" phone="$(tr -cd 0-9 <<< "$2")" szp="${#phone}";: "${npd:=true}"

    # List words matching all substrings of phone starting with its 1st digit.
    read -rd'\0' res < <(for ((i=szp;i>0;i--))
                         do look -ft: "${phone::$i}:" "$dict"
                         done | cut -d: -f2)

    # No matches? Accept 1st digit, remember that this option has been used.
    (("${#res}">0)) && npd=true || {
            "$npd" && { npd=false; res="${phone::1}" ;} || return ;}

    # UTF-8 here, or umlauted characters would not count as one.
    for ((i=szp;i>0;i--))
    do  car="$(LC_ALL="en_US.UTF-8" grep -Pix ".{$i}" <<< "$res")"
        [[ "$car" ]] && {
            bracify <<< "$car"
            ((i==szp)) || { printf "%s" "%"  # % here is just a separator.
                            # The below recursively calls this very
                            # function on the remaining digits of the string.
                            cdr="$("${FUNCNAME[0]}" "$1" "${phone:$i}")"
                            [[ "$cdr" ]] && bracify <<< "$cdr" ;}; echo ;}
    done | sed -E '/%$/d' ;}

# Brace-expand it.
match-one-phone() { while read -r x
                    # No, I don't like this. But our input is assumed safe and
                    # has been cleaned. That said, I'd be happy to replace it.
                    do eval echo "$x"
                    done < <(_match-one-phone "$@") |
                        sed 'y/ %/\n /' | tr -d {} ;}

# Go over the whole phone list.
main() { [[ "$1" == "-u" ]] && { filter=cat; shift ;} || filter=deumlautify
         phones="${1:-in/sample-input}"  dic="${2:-in/sample-dict}"
         dlook="${dic}-$$"; makelook "$dic" > "$dlook"
         while read -r aphone
         do match-one-phone "$dlook" "$aphone" | sed -E "/./ s|^|$aphone: |"
         done < "$phones" | "$filter"; rm "$dlook" ;}

# Run it.
LC_ALL=C  main "$@"

Solution 6: Binary search — parallelized

Since my version above processes the phone sequentially, one simple way to improve speed would be to process them in parallel. For that, we'll use GNU parallel.

The only modifications needed on the code above would be at the end, after match-one-phone():

# Auxiliary
_mop() { match-one-phone "$1" "$2" | sed -E "/./ s|^|$2: |" ;}

# Exports
export -f bracify {_,}match-one-phone _mop

# Go over the whole phone list.
main() { [[ "$1" == "-u" ]] && { filter=cat; shift ;} || filter=deumlautify
         phones="${1:-in/sample-input}"  dic="${2:-in/sample-dict}"
         dlook="${dic}-$$"; makelook "$dic" > "$dlook"
         < "$phones" parallel -k _mop "$dlook" {} | "$filter"
         rm "$dlook" ;}

# Run it.
LC_ALL=C  main "$@"

This should divide runtime roughly by the number of cores of your machine. Nice, but still a far cry from byte-compiled Elisp.

Meta-script?

Given that the solutions above have many parts in common, I considered creating a unifying script with all the common parts and its small variations, plus a pick-your-algorithm selection menu that mixes and matches the functions. Bash's select (a shell keyword; try help select) is great for this sort of thing. I used it in my Designing a deck of cards in Bash.

But I had already spent the previous night dreaming about the unfathomable expansions of nested braces, and other types of delimiters were becoming jealous and pleading for my attention. So I thought wrapping it up and going back to parentheses was the sensible thing to do.

Worth storing?

I pondered whether the algorithms could be improved. I wasn't storing intermediate results — namely, the results of the digits-to-string conversions of all the digits-substrings of every phone number. So for example, if there were two numbers "789237" and "789123393789", the "789" conversion to alpha might end up being calculated three times from all the splits.

Maybe this was too costly? If so, it would have been better to store all of them as we go, and look them up to avoid recalculations. Worth the trouble?

Let's check.

#!/usr/bin/env bash
#
# SPDX-FileCopyrightText:  © flandrew
# SPDX-License-Identifier: GPL-3.0-or-later
#
## worth-storing.sh
#
## Check whether it's worth storing intermediate results.

just-dig()  { tr -cd '[0-9]\n' ;}

all-subs()  { while read -r w; do
                  l="${#w}"
                  for ((k=l;k>0;k--)); do
                      j="$((l-k))"
                      echo -e "${w::k}\n${w:k:j}"
                  done | grep .
              done ;}

get-stats() { while read -r sub
              do echo "${#sub}"
              done | datamash min 1 median 1 max 1 ;}  # GNU datamash

check-it()  { local I S U D
              I="$(just-dig <   "$1")"
              S="$(all-subs <<< "$I" | sort)"
              U="$(uniq     <<< "$S")"
              D="$(comm -23 <(echo "$S") <(echo "$U"))"

              (echo "s count min med max"
               for set in I S U D
               do echo "$set"
                  wc -l     <<< "${!set}"
                  get-stats <<< "${!set}"
               done | xargs -n5) | column -ts$' ' ;}

check-it "in/input.txt"

Which returns:

s	count	min	med	max
I	1000	1	22	46
S	41878	1	13	46
U	36830	1	14	46
D	5048	1	2	6

So from a total of 41878 substrings generated by our 1000 phone numbers, 5048 (12%) are repetitions. As a sanity check, the median size of all the substrings we are calculating is 13, whereas the median size of the repetitions is 2.

It makes sense. Since these "phone numbers" seem to be randomly-generated, I'd expect that for every added digit the probability of repetitions would fall roughly by 90%. In other words, smaller-length substrings are much more likely to be repeated. But length would hardly matter for the hash lookups, so our savings would be of ~12%, minus the cost of looking up (and, if not found, storing) all the ~42k as we go along.

But wait, this is not quite right. Sometimes the left-side substring will match no words (even after inserting one digit), so the right side won't be tested. Above we calculated with all left- and right-side tests. Here it is with only the left-side tests:

# Note: to do only the left-side tests,
#       replace the previous function with this one:
all-subs() { while read -r w; do
                 l="${#w}"
                 for ((k=l;k>0;k--)); do
                     echo "${w::k}"
                 done
             done ;}

and:

s	count	min	med	max
I	1000	1	22	46
S	21439	1	13	46
U	19143	1	14	46
D	2296	1	2	6

so that 2296÷21439 = 11% are repetitions. The actual ratio is between 11% and 12%.

Not worth it.

Perhaps with other datasets it would be. Say, if it were a city phone-book with oft-repeated prefixes, and a high number of smaller-sized phone numbers, and the code needed to be run often for the whole bulk — instead of just once and then for a few entries once in a while.

Ok, so that was about relative improvement of speeds. But I haven't said much so far about absolute speeds. How do they compare between each other?

And how does Bash's speed compare to Elisp's for this problem? We'll see this next.