Pansymbolic: Mapping all Emacs Lisp symbols
Pansymbolic
Before anything — here's the link to what's missing.
Does any of these sound like something you could do?
Goals
- Map all Emacs Lisp symbols out there (one map for each namespace).
- Provide tools to search these maps.
Why?
- To give you the highest confidence that the symbols you want to use for your package haven't already been used. Searching this database can inform your decisions and minimize the chance of a clash of symbols.
- All sorts of other uses that people might come up with (research, statistics, etc).
Process
Overview
- Have all elisp packages
- Extract all symbols from all these packages
- Remap: from package→symbols to symbol→packages
- Have tools to query it
What do we have?
I consider #3 and #4 well addressed by my SymTree and SymTable packages.
- SymTable can convert between the two input formats accepted by SymTree.
- SymTree tokenizes symbols, builds a huge tree with them, and provides functions to search prefixes in that tree.
Which of the two to use, or both, depends on final choices about formats and data structure.
Comparison of both is given further down.
What is missing?
Most of what is missing can be summarized as:
- Do we have a few functions that can reliably extract lists of symbols (one list per namespace) defined by any given elisp package?
- Do we have comprehensive, frequently updated archives of elisp packages from which to extract these symbols?
By "comprehensive" I mean a high coverage of all public elisp packages out there. - Do we have someone who can apply the first to the second, on a regular basis, and supply results in one of the two input formats accepted by SymTree?
If all answers are "yes", then the problem is mostly solved.
Note about primitives
There are more than 1400 native Emacs functions written in C, so a "package" 'EMACS-PRIMITIVES
needs to be added with its corresponding symbols.
These symbols could be extracted:
- directly from the C sources (somehow); or
- by loading the latest minor version of Emacs and running something like this:
(let (prims) (mapatoms (lambda (symbol) (and (functionp symbol) (subr-primitive-p (symbol-function symbol)) (push symbol prims)))) (sort prims #'string<))
Design notes
One of the main motivations for this project is to avoid collisions between elisp symbols. Inevitably, this brings up the topic of prefixes.
Conventional approach: try to determine what "the prefix" of a package is, store that, repeat for every package, and done.
This sounds simple, but isn't. In the end, the question "What is this package's prefix?" may admit multiple answers, all of which in principle valid; and, crucially, selecting between them depends on factors that are external to the package.
So my approach is to ignore that question altogether.
Symbols, not prefixes
Instead of asking the package a question that usually does not admit of a singular definitive answer, we simply slurp lists of symbols defined by the package, separated by namespace.
This symbol hoarding may, at first, look like an unnecessary complication. Instead, it seems to simplify things. Collection becomes straightforward, and queries can be more flexible.
Eventually, we'll ask not what "the prefix" is, but rather "does any package here has any symbol that has this (my query) as one of its prefixes?".
The prefix paradox
Sometimes, the best way to simplify a problem is by making it bigger.
By not trying to determine "the prefix" of any package in isolation, we can sooner answer questions about all prefixes of all symbols of all packages.
Data structures
Inputs and outputs
One way to store collected symbols is to put them in a map where each package is a key and each value is a list of symbols defined by that package in one namespace. Let's call this pack→syms
. It's a one-dimensional map.
Another way is sym→packs
. It can be produced either directly or by converting pack→syms
.
Here, each symbol is a key, and the value is a list of packages that define that symbol. For almost all symbols, that list will have only one package. When not, an unintended symbol collision might be going on. We can see that pack→syms
is also a one-dimensional map.
In order to query the symbols, it's cumbersome to have them be inside the lists in the values (pack→syms
). We'd have to walk all the lists in all the values for every search. That's expensive.
Some alternatives to choose from are:
- Make
pack→syms
; convert tosym→packs
, then totree:sym→packs
; query that. - Make
pack→syms
; convert tosym→packs
; query that. - Make
sym→packs
; convert totree:sym→packs
; query that. - Make
sym→packs
; query that. - Make
pack→syms
; convert directly totree:sym→packs
; query that.
What is tree:sym→packs
? That's a tree in which the symbols have been properly tokenized and the list of packages that define each symbol is to be found at the leaves. It can provide visualization, navigation, and extremely fast "prefix" queries.
Partitioning
Given the huge size of the results (easily a few hundred thousand symbols for the functions' namespace), it seems important to horizontally partition sym→packs
and/or tree:sym→packs
.
Partitioning can be alphabetic, and perhaps first letter is enough, with an extra bucket (say, %
) to put anything starting with [^a-zA-Z0-9]
. So:
- A search for
"foo-this"
or"f-ext"
would open and read filef
. - A search for
""
,"<
", ="/"
,"+"
, or"-each"
would read file%
.
Storage formats
These flat maps could be stored in any way that we have available to represent one-dimensional key–value collections. Which one to pick will depend on whether we want to have it all in elisp or not.
All in elisp
Hash tables — that's my suggestion.
In this case:
- SymTable can convert from
pack→syms
tosym→packs
. - SymTree can convert from either of these two to
tree:sym→packs
.
Note that tree:sym→packs
here isn't a "regular tree" (nested lists), but a nested hash table.
For more information, see the respective packages.
One question is whether it'd be better to have only sym→packs
hash table(s) (without also the corresponding tree:sym→packs
). The most critical factor would be querying speed.
Speed
To query a "foo-this"
prefix:
- With
tree:sym→packs
, we'd be sequentially looking up exact matches of, respectively,"foo"
,"-"
,"this"
in a nested hash table: instantaneous. - With
sym→packs
, we'd be walking all keys of a large flat hash table to see which of them matches"^foo-this$\\|^foo-thisSEP-RX"
: clearly slower.
But how much slower, and how much would that matter?
That would need some testing. You can build a single huge sym→packs
hash table with real data and then time the queries.
Speed in this case could be enhanced by aggressively partitioning into several smaller files. Having 27 files is one option (a-z%
). Another option is having 27 directories, and then having one file per "first prefix". Example:
. ├── % ├── a ├── b ├── ... ├── n └── o ├── ob ├── ... ├── org ├── ... └── ox
Then inside the file ob
all symbols whose first prefix is "ob"
, from all packages, would be listed.
In the
sym→packs
version, all complete symbols would be keys:ob-this
,ob-that
,ob--that-too
, etc.
;; Symbols (funs) Packages that define them (h* "ob-this" '(ob-this) "ob-that" '(ob ob-plus) ; collision! "ob--that-too" '(ob))
Note:
(h*…)
is a hash table creation function from xht.#s(hash-table…)
, the native format, can be used instead.
In the
tree:sym→packs
version,ob
would be the only key, and the other tokens branch from there:
(symtree-h-make 'sp (h* "ob-this" '(ob-this) "ob-that" '(ob ob-plus) ; collision! "ob--that-too" '(ob))) H=> (h* "ob" (h* "-" (h* "this" (h* :@ '(ob-this)) "that" (h* :@ '(ob ob-plus)) ; collision! "-" (h* "that" (h* "-" (h* "too" (h* :@ '(ob))))))))
With
tree:sym→packs
, the speed gains of partitioning in many files would be negligible compared to having it all in one huge nested hash table (with several keys likeob
).
But it could still be worth doing, for two reasons:
- less memory (a much smaller table is loaded for each query)
- possibility of selective downloads (a user might want to fetch only the
o
folder or theob
file, for example, instead of the whole thing).
- less memory (a much smaller table is loaded for each query)
Other data formats
If data formats other than elisp are fine, then flat TSV is a good choice.
Both pack→syms
and sym→packs
could be put into a two-column TSV each.
Since sym→packs
could have hundreds of thousands of lines, it might still be a good idea to partition it alphabetically into several files.
Querying sym→packs
could then be done by grepping lines by regex and then uniq-ifying the values. Naturally, this could also be coded in elisp and done from inside emacs in a platform-independent fashion.
Below is a simple solution to that, using a new function candidate for my SymTable package. Note: a KVL, which means "key–value lines", is just a headerless two-column TSV that is flexible about field separators.
;; Example of a possible new function for non-elisp data format (defun symtable/query-pre-in-kvl (korkf prefix &optional kvl-sep) "Return packages that have PREFIX. KORKF is either a KVL file or a KVL. For the meaning of KVL, see ‘h<-kvl’. For the meaning of PREFIX, see ‘symtable-query-pre’. Optional argument KVL-SEP is the separator used in the KVL. If nil, use ‘h-kvl-sep-re’, whose value is \" *= *\". Pass \\t for tabs." (let* ((case-fold-search nil) (h-kvl-sep-re (if kvl-sep kvl-sep h-kvl-sep-re)) (kvl (if (file-exists-p korkf) (h-read-kvl korkf) korkf)) (rx (symtable--query-make-pre-rx prefix)) (htb)) (with-temp-buffer (save-excursion (insert kvl)) (keep-lines rx) (setq htb (h<-kvl (buffer-string)))) (h--hmap! key (s-split " " value 'no-nulls) htb) (symtable-query-rx htb rx)))
So that:
(let ((kvl " foo-this = foo foo--this-thing = foo foo-plus global-foo-mode = foo bar--that = bar global-bar-mode = bar baz-global-doing = baz globalize-things = globalize qux/mode = qux global-qux/mode = qux-lite qux qux/something = qux")) (symtable/query-pre-in-kvl kvl "global")) => '("bar" "foo" "qux" "qux-lite")
This means these four packages define at least one function that has global
as one of its prefixes. Note that globalize
isn't a match, because it's not a "prefix for packaging purposes".
(The above isn't the most efficient way of doing that, but it's simple enough. It could be optimized later, if needed, after exact formats have been chosen.)
Products and availability
Once ready, the maps should be:
- in an open format
- available for download
- mirrored in different forges
Input maps could also be made available.
A proper license should be chosen for these files.
Frequency of update has to be determined.
Besides all that, people might also welcome APIs or web interfaces to search all these maps online, if someone can offer it. At the very least, this would include "Prefix as a Service".
Development
I thought about design, data structures, the overall process. I also wrote two packages to address steps #3 and #4. That's what I can offer.
I have no interest in dealing with the first two steps, namely the retrieval of packages or the extraction of symbols.
Going forward, I might still be able to fix bugs and make improvements in SymTree and SymTable, if that's needed.
Other people are needed for the rest. In particular:
- Solving the first two steps, which means being able to reliably produce results in
pack→syms
orsym→packs
format. - Testing SymTable and/or SymTree against these inputs.
- Owning the ongoing process (which should be mostly automatable).
These are not necessarily all the same person — but could be.
If you write code that has a chance of providing an automated solution for #1 and #2 (great!), consider hosting it on (or at least mirroring it to) SourceHut.
Links
Issue trackers
If you believe you have found a bug while feeding an input file to either SymTree or SymTable and would like me to have a look, please gzip the input file, then either:
- open a new issue in the respective tracker (links above) and attach it there
- or attach it to an email to me
If, however, you believe that the bug can be reproduced with only a few pairs of the map, then paste just that instead.
There's also a tracker for Pansymbolic. It can be used to discuss this project itself, as well as to volunteer to do any of (or all) the missing parts.
Acknowledgments
I had this idea a couple years ago; I thought this database should exist. Yet I was unsure whether there'd be any interest.
Later, Jonas Bernoulli shared with me the outlines of his idea of an independent database of prefixes. Its goals were similar. This influenced my putting the time to publish SymTable, SymTree, and this document — and launch this project.
I hope this can pave the way to the first working version of a queryable database of Emacs Lisp symbols.
License
This project follows the REUSE Specification (FAQ), which in turn is built upon SPDX.
Therefore, license and copyright information can be found in:
- each file's comment header, or
- an adjacent file of the same name with the additional extension
.license
, or - the
.reuse/dep5
file
The full text of the licenses can be found in the LICENSES
subdirectory.