Phones-to-Words Challenge II: Bash as a horribly-slow-but-surprisingly-concise alternative to Lisp and Java
"Bash has no native libraries."
"It's hard to read."
"Too slow!"
I could go on.
But boy, how concise it can be!
After having done in Emacs Lisp the exercise from this old Lisp as an alternative to Java study, I decided to tackle it in Bash. I wondered how it would look like. Would it be too slow? How to approach these recursive tree structures and table look-ups?
Here's what I found out.
Brace yourself
Excluding commented and empty lines, the size of Lispers' versions ranged from 50 to 181 lines of code, with median 134. My first Bash version ended up with 42 LoC — a bit over half the 72 LoC of my Emacs Lisp solution. I ended up producing some seven "workable" versions of it, ranging from 38 to 48 LoC, median 40.
Bash has no tree-like data types. There're strings, there're arrays, and that's it.
But! Nested braces are isomorphic to tree structures. So my approach was to abuse Bash's brace expansion and recursiveness to emulate branching and Cartesian products.
This sounds too abstract, so let me give you an example. Here's a Lisp tree:
'("a" ("bc" "e" "f") ("bd" "e" "f")) ;; Structure: ;; a ;; ├─bc ;; │ ├─e ;; │ └─f ;; └─bd ;; ├─e ;; └─f
So if these were solutions to a phone number 1234
and translation 1:a 2:b 3:[cd], 4:[ef]
, we'd output this:
1234: a bc e 1234: a bc f 1234: a bd e 1234: a bd f
Good. Now here's the exact same representation in Bash using braces — and a one-liner that expands it to the exact same output as above:
phone="1234" tree="a%{bc%{e,f},bd%{e,f}}" # % = separator eval echo "$tree" | sed "y/ %/\n /" | sed "s/^/$phone: /"
Now, in our phone problem, chunks of matches are independent of each other. So we could, alternatively, represent things as a list of flat lists, and the actual output would be the Cartesian product of the internal lists.
In Elisp, this:
(require 'dash) (let* ((output "") (phone "1234") (levels '(("a") ("bc" "bd") ("e" "f"))) (solutions (apply '-table-flat 'list levels))) (dolist (sol solutions) (setq output (concat output phone ": " (string-join sol " ") "\n"))) output)
And in Bash, this:
phone="1234" # % = separator levels="a%{bc,bd}%{e,f}" # this is flat, for Cartesian product; same expansion eval echo "$levels" | sed "y/ %/\n /" | sed "s/^/$phone: /"
In practice it's actually a bit more complicated than that, because of different-sized chunkings. For example, three digits could become either 3, 2+1 or 1+2 words.
So we'll have a mix. Could end up looking somewhat like this:
phone="12345678" # 1:x 2:y 3:[qz] 4:[mi] 5:[ky] 6:[cn] 7:[ao] 8:[tw] echo x%{yz,yq}%{my%cat,I%know} | sed "y/ %/\n /" | sed "s/^/$phone: /"
12345678: x yz my cat 12345678: x yz I know 12345678: x yq my cat 12345678: x yq I know
You get the idea.
From here, the problem sort of reduces itself to generating a $solutions
string for expansion. But first, we still need to decide how to do the look-ups.
In Emacs Lisp this was a no-brainer: it has hash tables (and the good ht
library), and that's it. I didn't even think about other options. Whereas Bash... well, doesn't.
This left me curious about how the different worse-but-that's-what-we-got solutions would stack against each other. So there we go.
The solutions
Solution 1: Associative arrays
My first approach was to use an associative array. This is an option since Bash 4.3. I choose to restore umlauts to the German words, so that there's a one-to-one correspondence of digits-to-characters. This also helps bypass some quoting and escaping problems with the Umlauts-as-double-quotes of the source dictionary file. I keep a flag option for returning Umlauted results if desired, but to strictly follow the given instructions, I restore the original by default.
So the values will be the umlauted words. The keys will be the digits after the specified conversion table.
I run it with LC_ALL=C
for speed improvements. This forces me to restore UTF-8 in one place to avoid multibyte problems.
Here's an almost comment-free version of the solution — in case you prefer it cleaner or want to try to understand the code without explanations. A highly-commented one follows it.
Uncommented:
umlautify() { sed 's/A"/Ä/g;s/a"/ä/g;s/O"/Ö/g;s/o"/ö/g;s/u"/ü/g;s/U"/Ü/g';} deumlautify() { sed 's/Ä/A"/g;s/ä/a"/g;s/Ö/O"/g;s/ö/o"/g;s/ü/u"/g;s/Ü/U"/g';} 2dig() { sed "s/ä/5/g;s/ü/7/g;s/ö/8/g;s/Ä/5/g;s/Ü/7/g;s/Ö/8/g" | tr A-Z a-z | tr jnqrwxdsyftamcivbkulopghze 11122233344556667778889990 ;} makeaa() { local dic kvp; declare -gA AA # kvp: key–value pair read -rd'\0' dic < <(< "$1" tr -cd 'a-zA-Z"\n' | umlautify) read -rd'\0' kvp < <(paste -d: <(2dig <<<"$dic") <(cat <<<"$dic")) : "$(sed -E 's|(.*):(.*)|["\1"]+="\2 "|' <<<"$kvp")" . <(printf 'AA=(%s)' "$_") ;} bracify() { : "$(tr '\n' ,)" [[ "$_" =~ ^,*$ ]] && : "" || { [[ "$_" =~ ^[^,]*,$ ]] && : "${_%%,}" || : "{${_%%,}}" } && printf "%s" "$_" ;} _match-one-phone() { phone="$(tr -cd '0-9'<<<"$1")" szp="${#phone}";: "${npd:=true}" read -rd'\0' res < <(for ((i=szp;i>0;i--)) do printf "%s " "${AA[${phone::$i}]}" done | grep -Eo "[^ ]+") (("${#res}">0)) && npd=true || { "$npd" && { npd=false; res="${phone::1}" ;} || return ;} for ((i=szp;i>0;i--)) do car="$(LC_ALL="en_US.UTF-8" grep -Pix ".{$i}" <<< "$res")" [[ "$car" ]] && { bracify <<< "$car" ((i==szp)) || { printf "%s" "%" # % here is just a separator. cdr="$("${FUNCNAME[0]}" "${phone:$i}")" [[ "$cdr" ]] && bracify <<< "$cdr" ;}; echo ;} done | sed -E '/%$/d' ;} match-one-phone() { while read -r x do eval echo "$x" done < <(_match-one-phone "$@") | sed 'y/ %/\n /' | tr -d {} ;} main() { [[ "$1" == "-u" ]] && { filter=cat; shift ;} || filter=deumlautify makeaa "${2:-in/sample-dict}" while read -r aphone do match-one-phone "$aphone" | sed -E "/./ s|^|$aphone: |" done < "${1:-in/sample-input}" | "$filter" ;} LC_ALL=C main "$@"
Commented version:
#!/usr/bin/env bash # # SPDX-FileCopyrightText: © flandrew # SPDX-License-Identifier: GPL-3.0-or-later # ## p2w-aarray.sh # ## Convert phone numbers to words # Replace [AOUaou]\" with properly-umlauted single characters, and vice-versa. # (A bit too hardcoded for my taste, but not worth generalizing it here.) umlautify() { sed 's/A"/Ä/g;s/a"/ä/g;s/O"/Ö/g;s/o"/ö/g;s/u"/ü/g;s/U"/Ü/g';} deumlautify() { sed 's/Ä/A"/g;s/ä/a"/g;s/Ö/O"/g;s/ö/o"/g;s/ü/u"/g;s/Ü/U"/g';} # Convert letters to digits according to the given rules. # Because of LC_ALL=C, umlauted characters are multibyte. 2dig() { sed "s/ä/5/g;s/ü/7/g;s/ö/8/g;s/Ä/5/g;s/Ü/7/g;s/Ö/8/g" | tr A-Z a-z | tr jnqrwxdsyftamcivbkulopghze 11122233344556667778889990 ;} # Make associative array, declare it as a global variable: key=digits of # original dictionary, value=words. Though not required, I sanitize the # dictionary for safety. Instead of looping through lines or through indices # of a pair of indexed arrays and setting each pair of the associative array # one by one, I generate the string that would set all key–value pairs at once # — then evaluate that. makeaa() { local dic kvp; declare -gA AA # kvp: key–value pair read -rd'\0' dic < <(< "$1" tr -cd 'a-zA-Z"\n' | umlautify) read -rd'\0' kvp < <(paste -d: <(2dig <<<"$dic") <(cat <<<"$dic")) : "$(sed -E 's|(.*):(.*)|["\1"]+="\2 "|' <<<"$kvp")" . <(printf 'AA=(%s)' "$_") ;} # If empty, returns nothing; if one entry, return it; if more than one entry, # commatize it and brace it. This way, we can "abuse" Bash's brace expansion. bracify() { : "$(tr '\n' ,)" [[ "$_" =~ ^,*$ ]] && : "" || { [[ "$_" =~ ^[^,]*,$ ]] && : "${_%%,}" || : "{${_%%,}}" } && printf "%s" "$_" ;} # Solve the problem for a single phone number. _match-one-phone() { # npd = not previously a digit phone="$(tr -cd '0-9'<<<"$1")" szp="${#phone}";: "${npd:=true}" # List words matching all substrings of phone starting with its 1st digit. read -rd'\0' res < <(for ((i=szp;i>0;i--)) do printf "%s " "${AA[${phone::$i}]}" done | grep -Eo "[^ ]+") # No matches? Accept 1st digit, remember that this option has been used. (("${#res}">0)) && npd=true || { "$npd" && { npd=false; res="${phone::1}" ;} || return ;} # UTF-8 here, or umlauted characters would not count as one. for ((i=szp;i>0;i--)) do car="$(LC_ALL="en_US.UTF-8" grep -Pix ".{$i}" <<< "$res")" [[ "$car" ]] && { bracify <<< "$car" # If equal to phone size, there's no right substring. ((i==szp)) || { printf "%s" "%" # % here is just a separator. # The below recursively calls this very # function on the remaining digits of the string. cdr="$("${FUNCNAME[0]}" "${phone:$i}")" [[ "$cdr" ]] && bracify <<< "$cdr" ;}; echo ;} done | sed -E '/%$/d' ;} # Brace-expand it. match-one-phone() { while read -r x # No, I don't like this. But our input is assumed safe and # has been cleaned. That said, I'd be happy to replace it. do eval echo "$x" done < <(_match-one-phone "$@") | sed 'y/ %/\n /' | tr -d {} ;} # Go over the whole phone list. main() { [[ "$1" == "-u" ]] && { filter=cat; shift ;} || filter=deumlautify makeaa "${2:-in/sample-dict}" while read -r aphone do match-one-phone "$aphone" | sed -E "/./ s|^|$aphone: |" done < "${1:-in/sample-input}" | "$filter" ;} # Run it. LC_ALL=C main "$@"
(In all the following, I'll show only the commented versions.)
Why not simply indexed arrays instead?
I quickly tried indexed arrays for comparison, but gave up. Problems:
- Assignments get slower as
n
goes up. This was also the case with associative arrays, but here much, much slower. After about 10k entries it choked, whereas with associative 73k was fine. - Numbers generated from the conversion aren't ready to be assigned as indices, because:
- Longer words generate longer integers, which pass the limit. Judging by error messages, somewhere around 20 letters was the maximum. But the dictionary could have words up to 50 by specification.
- Integer indices require managing decodings with leading zeros (simple to solve, though).
- Longer words generate longer integers, which pass the limit. Judging by error messages, somewhere around 20 letters was the maximum. But the dictionary could have words up to 50 by specification.
I did it because I was curious to test Bash's limitations. But then... I caught myself engaged in a hackish attempt to work around it by manually excluding long words with something like this:
sed "/Abänderungsanträge/d /Abänderungsvorschlag/d /Abbaufortschrittsmessung/d /Abfertigungsschalter/d (...) /Entbindungsabteilung/d /Integrationsfähigkeit/d (...)" <<<"$DICT"
upon which I realized that the right thing to do was to drop it and try something else. After all, it wouldn't be fair to the rightful owners of phone numbers 0-Vortriebsgeschwindigkeit
and neu-5-achtundsiebzigjährig
...
Solution 2: Associative arrays — generalized
I wondered if having more arrays would improve speeds. Since lookup is in linear time, it should be faster to first find out which of 10 arrays the answer is in and then search 7.3k-word arrays than to search a single 73k-words array. Or even better, in which of 100 730-word arrays.
So I coded a general solution that takes the exponent of number of arrays in base 10 as input. And n=0 should roughly result to the same speed of the previous one.
Makes no sense to use n=3 here, of course — 1000 73-word arrays shouldn't help much; and beyond that much less, and I didn't try.
#!/usr/bin/env bash # # SPDX-FileCopyrightText: © flandrew # SPDX-License-Identifier: GPL-3.0-or-later # ## p2w-aarray-many.sh # ## Convert phone numbers to words # Replace [AOUaou]\" with properly-umlauted single characters, and vice-versa. # (A bit too hardcoded for my taste, but not worth generalizing it here.) umlautify() { sed 's/A"/Ä/g;s/a"/ä/g;s/O"/Ö/g;s/o"/ö/g;s/u"/ü/g;s/U"/Ü/g';} deumlautify() { sed 's/Ä/A"/g;s/ä/a"/g;s/Ö/O"/g;s/ö/o"/g;s/ü/u"/g;s/Ü/U"/g';} # Convert letters to digits according to the given rules. # Because of LC_ALL=C, umlauted characters are multibyte. 2dig() { sed "s/ä/5/g;s/ü/7/g;s/ö/8/g;s/Ä/5/g;s/Ü/7/g;s/Ö/8/g" | tr A-Z a-z | tr jnqrwxdsyftamcivbkulopghze 11122233344556667778889990 ;} # Make 10ⁿ associative arrays, declare them as global variables: key=digits of # original dictionary, value=words. Though not required, I sanitize the # dictionary for safety. Instead of looping through lines or through indices # of a pair of indexed arrays and setting each pair of the associative array # one by one, I generate the string that would set all key–value pairs at once # — then evaluate that. $1=dictionary file; $2=exponent number of arrays makeaa() { local dic kvp # kvp: key–value pair read -rd'\0' dic < <(< "$1" tr -cd 'a-zA-Z"\n' | umlautify) read -rd'\0' kvp < <(paste -d: <(2dig <<<"$dic") <(cat <<<"$dic")) case "$2" in 0) : "$(<<<"$kvp" sed -E 's|(.*):(.*)|["\1"]+="\2 "|')" . <(printf 'declare -gA AA; AA=(%s)' "$_") ;; *) while read -r i do : "$(<<<"$kvp" grep "^$i")" : "$(<<<"$_" sed -E 's|(.*):(.*)|["\1"]+="\2 "|')" . <(printf 'declare -gA AA%s; AA%s=(%s)' "$i" "$i" "$_") done < <(seq -w 0 "$((10**$2-1))") esac ;} # If empty, returns nothing; if one entry, return it; if more than one entry, # commatize it and brace it. This way, we can "abuse" Bash's brace expansion. bracify() { : "$(tr '\n' ,)" [[ "$_" =~ ^,*$ ]] && : "" || { [[ "$_" =~ ^[^,]*,$ ]] && : "${_%%,}" || : "{${_%%,}}" } && printf "%s" "$_" ;} # Solve the problem for a single phone number. _match-one-phone() { # npd = not previously a digit phone="$(tr -cd '0-9'<<<"$1")" szp="${#phone}" pre="$3";: "${npd:=true}" declare -n ARRAYNAME="AA${phone::$pre}" # List words matching all substrings of phone starting with its 1st digit. read -rd'\0' res < <(for ((i=szp;i>0;i--)) do printf "%s " "${ARRAYNAME[${phone::$i}]}" done | grep -Eo "[^ ]+") # No matches? Accept 1st digit, remember that this option has been used. (("${#res}">0)) && npd=true || { "$npd" && { npd=false; res="${phone::1}" ;} || return ;} # UTF-8 here, or umlauted characters would not count as one. for ((i=szp;i>0;i--)) do car="$(LC_ALL="en_US.UTF-8" grep -Pix ".{$i}" <<< "$res")" [[ "$car" ]] && { bracify <<< "$car" # If equal to phone size, there's no right substring. ((i==szp)) || { printf "%s" "%" # % here is just a separator. # The below recursively calls this very # function on the remaining digits of the string. cdr="$("${FUNCNAME[0]}" "${phone:$i}" _ "$pre")" [[ "$cdr" ]] && bracify <<< "$cdr" ;}; echo ;} done | sed -E '/%$/d' ;} # Brace-expand it. match-one-phone() { while read -r x # No, I don't like this. But our input is assumed safe and # has been cleaned. That said, I'd be happy to replace it. do eval echo "$x" done < <(_match-one-phone "$@") | sed 'y/ %/\n /' | tr -d {} ;} # Go over the whole phone list. # Usage: main [-u] [dictionary_file] [input_file] [n] # Where 10ⁿ arrays (default: n=1) main() { [[ "$1" == "-u" ]] && { filter=cat; shift ;} || filter=deumlautify makeaa "${2:-in/sample-dict}" "${3:-1}" while read -r aphone do match-one-phone "$aphone" _ "${3:-1}" | sed -E "/./ s|^|$aphone: |" done < "${1:-in/sample-input}" | "$filter" ;} # Run it. LC_ALL=C main "$@"
Solution 3: Simple grep
What if instead of fancier data structures we simply grepped the result directly, from the one big dictionary text file? That's what I do here.
#!/usr/bin/env bash # # SPDX-FileCopyrightText: © flandrew # SPDX-License-Identifier: GPL-3.0-or-later # ## p2w-grep.sh # ## Convert phone numbers to words # Replace [AOUaou]\" with properly-umlauted single characters, and vice-versa. # (A bit too hardcoded for my taste, but not worth generalizing it here.) umlautify() { sed 's/A"/Ä/g;s/a"/ä/g;s/O"/Ö/g;s/o"/ö/g;s/u"/ü/g;s/U"/Ü/g';} deumlautify() { sed 's/Ä/A"/g;s/ä/a"/g;s/Ö/O"/g;s/ö/o"/g;s/ü/u"/g;s/Ü/U"/g';} # Convert letters to digits according to the given rules. # Because of LC_ALL=C, umlauted characters are multibyte. 2dig() { sed "s/ä/5/g;s/ü/7/g;s/ö/8/g;s/Ä/5/g;s/Ü/7/g;s/Ö/8/g" | tr A-Z a-z | tr jnqrwxdsyftamcivbkulopghze 11122233344556667778889990 ;} # Create a colon-separated lookup table from the digits words to the original. makelook() { read -rd'\0' dic < <(<"$1" tr -cd 'a-zA-Z"\n' | umlautify) paste -d: <(2dig <<< "$dic") <(cat <<< "$dic") | sort -d ;} # If empty, returns nothing; if one entry, return it; if more than one entry, # commatize it and brace it. This way, we can "abuse" Bash's brace expansion. bracify() { : "$(tr '\n' ,)" [[ "$_" =~ ^,*$ ]] && : "" || { [[ "$_" =~ ^[^,]*,$ ]] && : "${_%%,}" || : "{${_%%,}}" } && printf "%s" "$_" ;} # Solve the problem for a single phone number. _match-one-phone() { # npd = not previously a digit phone="$(tr -cd 0-9 <<< "$1")" szp="${#phone}";: "${npd:=true}" # List words matching all substrings of phone starting with its 1st digit. read -rd'\0' res < <(for ((i=szp;i>0;i--)) do grep "^${phone::$i}:" <<< "$DLOOK" # do sed -n "/^${phone::$i}:/p" <<< "$DLOOK" done | cut -d: -f2) # No matches? Accept 1st digit, remember that this option has been used. (("${#res}">0)) && npd=true || { "$npd" && { npd=false; res="${phone::1}" ;} || return ;} # UTF-8 here, or umlauted characters would not count as one. for ((i=szp;i>0;i--)) do car="$(LC_ALL="en_US.UTF-8" grep -Pix ".{$i}" <<< "$res")" [[ "$car" ]] && { bracify <<< "$car" ((i==szp)) || { printf "%s" "%" # % here is just a separator. # The below recursively calls this very # function on the remaining digits of the string. cdr="$("${FUNCNAME[0]}" "${phone:$i}")" [[ "$cdr" ]] && bracify <<< "$cdr" ;}; echo ;} done | sed -E '/%$/d' ;} # Brace-expand it. match-one-phone() { while read -r x # No, I don't like this. But our input is assumed safe and # has been cleaned. That said, I'd be happy to replace it. do eval echo "$x" done < <(_match-one-phone "$@") | sed 'y/ %/\n /' | tr -d {} ;} # Go over the whole phone list. main() { [[ "$1" == "-u" ]] && { filter=cat; shift ;} || filter=deumlautify phones="${1:-in/sample-input}" dic="${2:-in/sample-dict}" declare -g DLOOK; DLOOK="$(makelook "$dic")" while read -r aphone do match-one-phone "$aphone" | sed -E "/./ s|^|$aphone: |" done < "$phones" | "$filter" ;} # Run it. LC_ALL=C main "$@"
Solution 4: Abusing the filesystem
So Bash doesn't really have trees. What a shame.
But do you know what does have trees? The filesystem. Access to filenames are "instantaneous". So what about we store each word as a directory named as itself, whose parent is its digits' translation?
This doesn't conform with the specified instructions of loading the whole dictionary in memory. But hey, we'll do it anyway to see what happens.
Let me illustrate. Suppose that our dictionary is very small and have only six words. Here they are, preceded by their computed digits according to the given translation:
3962803 Shirley 30883 sells 305390883 seashells 81 on 490 the 30539820 seashore
So a phone of "3962803" could be translated to "Shirley". So we want to create this empty directory:
./PMHT/3962803/Shirley
where PMHT stands for "Poor Man's Hash Table". Of course.
The beauty of this idea is that creating this "hash table" is simplicity itself. Given the pairs above in space-separated-fields stored as a variable:
space_table="\ 3962803 Shirley 30883 sells 305390883 seashells 81 on 490 the 30539820 seashore" slash_table="$(sed -E "s| +|/|" <<< "$space_table")"
a robust way to create your filesystem-based table is this:
mkdir -p ./PMHT && cd ./PMHT || exit set -- while IFS='' read -r -d '' pair do set -- "$@" "$pair" done < <(tr '\n' '\0' <<< "$slash_table") # mkdir -p "3962803/Shirley" "30883/sells" . . . mkdir -p "$@"
This will pass all pairs, properly quoted, to a single call of mkdir
. Then:
tree
gives us:
. ├── 305390883 │ └── seashells ├── 30539820 │ └── seashore ├── 30883 │ └── sells ├── 3962803 │ └── Shirley ├── 490 │ └── the └── 81 └── on 12 directories, 0 files
Passing one argument at a time to mkdir
will be slow. We could manually chunk it:
# Space-separated chunks of 10000. Assumes sanitized input with no strange characters, etc. while read -r chunk do set -- $chunk mkdir -p "$@" done < <(chunksize=10000; xargs -n"$chunksize" <<< "$slash_table")
For 73k, the string generation itself took less than 2 seconds on an old machine I used for testing, whereas passing it to mkdir
and creating them took 2 minutes. Chunking speeds it up, and helps avoid exceeding argument list limits.
We can do better yet with GNU parallel, which will handle quoting and do it faster:
parallel -m mkdir -p {} <<< "$slash_table"
This took less than a minute to create.
Speaking of shells, here is the code:
#!/usr/bin/env bash # # SPDX-FileCopyrightText: © flandrew # SPDX-License-Identifier: GPL-3.0-or-later # ## p2w-fsystem-ascii.sh # ## Convert phone numbers to words # Replace [AOUaou]\" with properly-umlauted single characters. # (A bit too hardcoded for my taste, but not worth generalizing it here.) umlautify() { sed 's/A"/Ä/g;s/a"/ä/g;s/O"/Ö/g;s/o"/ö/g;s/u"/ü/g;s/U"/Ü/g';} # Convert letters to digits according to the given rules. # Because of LC_ALL=C, umlauted characters are multibyte. 2dig() { tr A-Z a-z | tr -d \" | tr jnqrwxdsyftamcivbkulopghze 11122233344556667778889990 ;} # Create a filesystem-based hash table. makefshash() { local dic; HTDIR="$2" read -rd'\0' dic < <(<"$1" tr -cd 'a-zA-Z"\n') mkdir -p "$HTDIR" && (cd "$HTDIR" || { echo "Oops" >&2; exit;} paste <(2dig <<< "$dic" | sed -E "s|^.|&/&|") -d/ \ <(tr \" = <<< "$dic") | parallel -m mkdir -p {}) ;} # If empty, returns nothing; if one entry, return it; if more than one entry, # commatize it and brace it. This way, we can "abuse" Bash's brace expansion. bracify() { : "$(tr '\n' ,)" [[ "$_" =~ ^,*$ ]] && : "" || { [[ "$_" =~ ^[^,]*,$ ]] && : "${_%%,}" || : "{${_%%,}}" } && printf "%s" "$_" ;} # Solve the problem for a single phone number. _match-one-phone() { # npd = not previously a digit phone="$(tr -cd 0-9 <<< "$1")" szp="${#phone}";: "${npd:=true}" # List words matching all substrings of phone starting with its 1st digit. read -rd'\0' res < <(for ((i=szp;i>0;i--)) do cd "$HTDIR/${phone::1}/${phone::$i}/" 2>/dev/null\ && printf "%s\n" * # glob! done) # No matches? Accept 1st digit, remember that this option has been used. (("${#res}">0)) && npd=true || { "$npd" && { npd=false; res="${phone::1}" ;} || return ;} for ((i=szp;i>0;i--)) do car="$(grep -Pix "([^=]=*){$i}" <<< "$res")" [[ "$car" ]] && { bracify <<< "$car" ((i==szp)) || { printf "%s" "%" # % here is just a separator. # The below recursively calls this very # function on the remaining digits of the string. cdr="$("${FUNCNAME[0]}" "${phone:$i}")" [[ "$cdr" ]] && bracify <<< "$cdr" ;}; echo ;} done | sed -E '/%$/d' ;} # Brace-expand it. match-one-phone() { while read -r x # No, I don't like this. But our input is assumed safe and # has been cleaned. That said, I'd be happy to replace it. do eval echo "${x//=/\\\"}" done < <(_match-one-phone "$@") | sed 'y/ %/\n /' | tr -d {} ;} # Go over the whole phone list. # Usage: main [-u] [dictionary_file] [phones_file] [hashtable_path] # All arguments are optional. Order above is strict. main() { [[ "$1" == "-u" ]] && { filter=umlautify; shift ;} || filter=cat phones="${1:-in/sample-input}" dic="${2:-in/sample-dict}" [[ -f "$dic" ]] || { echo "File $dic missing" >&2; exit ;} [[ -f "$phones" ]] || { echo "File $phones missing" >&2; exit ;} # NOTE: 1. "Hash table"'s root path is hardcoded here if not chosen. # 2. It only creates paths if directory doesn't exist. # 3. Intelligently detects new dictionaries for hashing. # PMHT = "Poor Man's Hash Table"... htroot="${3:-$HOME/tmp/PMHT}"; declare -g HTDIR HTDIR="$htroot/wc_$(wc -l <"$dic")/$(sha256sum "$dic"| cut -c1-20)" # Directory would then end up being something like this: # $HOME/tmp/PMHT/wc_73113/291b4e05c0638d586e4a [[ -d "$HTDIR" ]] || { echo "Creating hash path: $HTDIR" >&2 makefshash "$dic" "$HTDIR" && echo "Done">&2;} while read -r aphone do match-one-phone "$aphone" | sed -E "/./ s|^|$aphone: |" done < "$phones" | "$filter" ;} # Run it. LC_ALL=C main "$@"
Solution 5: Binary search
There're no hash tables, but we could have binary search. For that, I use BSD's look
, proper for dictionary search. The phone string is searched against a file with two fields (key:value), generated from the dictionary.
I create the file because there's no way that I know of to use look
with a dictionary stored in memory.
So this works:
look -ft: key dictdatafile # Some actual dictionary file, # colon-separated (key:value)
whereas none of these work:
dictdata="\ aba:122 key:123 key:456 def:777 " look -ft: key < <(echo -e "$dictdata") # Process substitution look -ft: key <(echo -e "$dictdata") # Process substitution look -ft: key <<< "$dictdata" # Here-strings echo -e "$dictdata" | look -ft: key # STDIN implicit echo -e "$dictdata" | look -ft: key - # STDIN explicit echo -e "$dictdata" | look -ft: key /dev/stdin # STDIN explicit (fake file) mkfifo dictdatafifo # Named pipes echo -e "$dictdata" > dictdatafifo look -ft: key dictdatafifo
Instructions say dictionary must be loaded in memory. So here I had to deviate from the instructions.
Couldn't we have a function implementing binary search in Bash itself and receiving input from memory instead of from a file? We could, and I later ended up coding one such function, and it's very slow, as I suspected it would be.
I also didn't want to search for less usual utilities.
So look
it was.
#!/usr/bin/env bash # # SPDX-FileCopyrightText: © flandrew # SPDX-License-Identifier: GPL-3.0-or-later # ## p2w-look.sh # ## Convert phone numbers to words # Replace [AOUaou]\" with properly-umlauted single characters, and vice-versa. # (A bit too hardcoded for my taste, but not worth generalizing it here.) umlautify() { sed 's/A"/Ä/g;s/a"/ä/g;s/O"/Ö/g;s/o"/ö/g;s/u"/ü/g;s/U"/Ü/g';} deumlautify() { sed 's/Ä/A"/g;s/ä/a"/g;s/Ö/O"/g;s/ö/o"/g;s/ü/u"/g;s/Ü/U"/g';} # Convert letters to digits according to the given rules. # Because of LC_ALL=C, umlauted characters are multibyte. 2dig() { sed "s/ä/5/g;s/ü/7/g;s/ö/8/g;s/Ä/5/g;s/Ü/7/g;s/Ö/8/g" | tr A-Z a-z | tr jnqrwxdsyftamcivbkulopghze 11122233344556667778889990 ;} # Create a colon-separated lookup table from the digits words to the original. makelook() { read -rd'\0' dic < <(<"$1" tr -cd 'a-zA-Z"\n' | umlautify) paste -d: <(2dig <<< "$dic") <(cat <<< "$dic") | sort -d ;} # If empty, returns nothing; if one entry, return it; if more than one entry, # commatize it and brace it. This way, we can "abuse" Bash's brace expansion. bracify() { : "$(tr '\n' ,)" [[ "$_" =~ ^,*$ ]] && : "" || { [[ "$_" =~ ^[^,]*,$ ]] && : "${_%%,}" || : "{${_%%,}}" } && printf "%s" "$_" ;} # Solve the problem for a single phone number. _match-one-phone() { # npd = not previously a digit dict="$1" phone="$(tr -cd 0-9 <<< "$2")" szp="${#phone}";: "${npd:=true}" # List words matching all substrings of phone starting with its 1st digit. read -rd'\0' res < <(for ((i=szp;i>0;i--)) do look -ft: "${phone::$i}:" "$dict" done | cut -d: -f2) # No matches? Accept 1st digit, remember that this option has been used. (("${#res}">0)) && npd=true || { "$npd" && { npd=false; res="${phone::1}" ;} || return ;} # UTF-8 here, or umlauted characters would not count as one. for ((i=szp;i>0;i--)) do car="$(LC_ALL="en_US.UTF-8" grep -Pix ".{$i}" <<< "$res")" [[ "$car" ]] && { bracify <<< "$car" ((i==szp)) || { printf "%s" "%" # % here is just a separator. # The below recursively calls this very # function on the remaining digits of the string. cdr="$("${FUNCNAME[0]}" "$1" "${phone:$i}")" [[ "$cdr" ]] && bracify <<< "$cdr" ;}; echo ;} done | sed -E '/%$/d' ;} # Brace-expand it. match-one-phone() { while read -r x # No, I don't like this. But our input is assumed safe and # has been cleaned. That said, I'd be happy to replace it. do eval echo "$x" done < <(_match-one-phone "$@") | sed 'y/ %/\n /' | tr -d {} ;} # Go over the whole phone list. main() { [[ "$1" == "-u" ]] && { filter=cat; shift ;} || filter=deumlautify phones="${1:-in/sample-input}" dic="${2:-in/sample-dict}" dlook="${dic}-$$"; makelook "$dic" > "$dlook" while read -r aphone do match-one-phone "$dlook" "$aphone" | sed -E "/./ s|^|$aphone: |" done < "$phones" | "$filter"; rm "$dlook" ;} # Run it. LC_ALL=C main "$@"
Solution 6: Binary search — parallelized
Since my version above processes the phone sequentially, one simple way to improve speed would be to process them in parallel. For that, we'll use GNU parallel.
The only modifications needed on the code above would be at the end, after match-one-phone()
:
# Auxiliary _mop() { match-one-phone "$1" "$2" | sed -E "/./ s|^|$2: |" ;} # Exports export -f bracify {_,}match-one-phone _mop # Go over the whole phone list. main() { [[ "$1" == "-u" ]] && { filter=cat; shift ;} || filter=deumlautify phones="${1:-in/sample-input}" dic="${2:-in/sample-dict}" dlook="${dic}-$$"; makelook "$dic" > "$dlook" < "$phones" parallel -k _mop "$dlook" {} | "$filter" rm "$dlook" ;} # Run it. LC_ALL=C main "$@"
This should divide runtime roughly by the number of cores of your machine. Nice, but still a far cry from byte-compiled Elisp.
Meta-script?
Given that the solutions above have many parts in common, I considered creating a unifying script with all the common parts and its small variations, plus a pick-your-algorithm selection menu that mixes and matches the functions. Bash's select
(a shell keyword; try help select
) is great for this sort of thing. I used it in my Designing a deck of cards in Bash.
But I had already spent the previous night dreaming about the unfathomable expansions of nested braces, and other types of delimiters were becoming jealous and pleading for my attention. So I thought wrapping it up and going back to parentheses was the sensible thing to do.
Worth storing?
I pondered whether the algorithms could be improved. I wasn't storing intermediate results — namely, the results of the digits-to-string conversions of all the digits-substrings of every phone number. So for example, if there were two numbers "789237" and "789123393789", the "789" conversion to alpha might end up being calculated three times from all the splits.
Maybe this was too costly? If so, it would have been better to store all of them as we go, and look them up to avoid recalculations. Worth the trouble?
Let's check.
#!/usr/bin/env bash # # SPDX-FileCopyrightText: © flandrew # SPDX-License-Identifier: GPL-3.0-or-later # ## worth-storing.sh # ## Check whether it's worth storing intermediate results. just-dig() { tr -cd '[0-9]\n' ;} all-subs() { while read -r w; do l="${#w}" for ((k=l;k>0;k--)); do j="$((l-k))" echo -e "${w::k}\n${w:k:j}" done | grep . done ;} get-stats() { while read -r sub do echo "${#sub}" done | datamash min 1 median 1 max 1 ;} # GNU datamash check-it() { local I S U D I="$(just-dig < "$1")" S="$(all-subs <<< "$I" | sort)" U="$(uniq <<< "$S")" D="$(comm -23 <(echo "$S") <(echo "$U"))" (echo "s count min med max" for set in I S U D do echo "$set" wc -l <<< "${!set}" get-stats <<< "${!set}" done | xargs -n5) | column -ts$' ' ;} check-it "in/input.txt"
Which returns:
s | count | min | med | max |
---|---|---|---|---|
I | 1000 | 1 | 22 | 46 |
S | 41878 | 1 | 13 | 46 |
U | 36830 | 1 | 14 | 46 |
D | 5048 | 1 | 2 | 6 |
So from a total of 41878 substrings generated by our 1000 phone numbers, 5048 (12%) are repetitions. As a sanity check, the median size of all the substrings we are calculating is 13, whereas the median size of the repetitions is 2.
It makes sense. Since these "phone numbers" seem to be randomly-generated, I'd expect that for every added digit the probability of repetitions would fall roughly by 90%. In other words, smaller-length substrings are much more likely to be repeated. But length would hardly matter for the hash lookups, so our savings would be of ~12%, minus the cost of looking up (and, if not found, storing) all the ~42k as we go along.
But wait, this is not quite right. Sometimes the left-side substring will match no words (even after inserting one digit), so the right side won't be tested. Above we calculated with all left- and right-side tests. Here it is with only the left-side tests:
# Note: to do only the left-side tests, # replace the previous function with this one: all-subs() { while read -r w; do l="${#w}" for ((k=l;k>0;k--)); do echo "${w::k}" done done ;}
and:
s | count | min | med | max |
---|---|---|---|---|
I | 1000 | 1 | 22 | 46 |
S | 21439 | 1 | 13 | 46 |
U | 19143 | 1 | 14 | 46 |
D | 2296 | 1 | 2 | 6 |
so that 2296÷21439 = 11% are repetitions. The actual ratio is between 11% and 12%.
Not worth it.
Perhaps with other datasets it would be. Say, if it were a city phone-book with oft-repeated prefixes, and a high number of smaller-sized phone numbers, and the code needed to be run often for the whole bulk — instead of just once and then for a few entries once in a while.
Ok, so that was about relative improvement of speeds. But I haven't said much so far about absolute speeds. How do they compare between each other?
And how does Bash's speed compare to Elisp's for this problem? We'll see this next.