Follow-up to part 1.
Last time, I talked a little bit about an implementation of a trie to store frequency information about n-grams. The problem is that the naïve implementation of a trie is much more compact than the source data, but still not small enough.
One approach to dealing with this problem is to do evil bit-twiddling hacks to reduce the size, but ultimately, this just gives you worse code and little real benefit.
In information theory, the entropy, also called Shannon entropy after Claude Shannon, is a measure of the information content of a random variable.
Information theory is pretty interesting, but for our purposes we’ll just consider two facts. First, if we have a random variable that takes values in an alphabet of size \(n\), then the entropy is maximized by a uniform distribution and here the entropy is \(\lg n\) bits. Second, Shannon’s source coding theorem tells us that, asymptotically speaking, the best compression rate possible is the entropy.
Now, let’s think about the space complexity of a naïve tree data structure, where we number all of the vertices, and store for each vertex a list of its children. For \(n\) vertices, this representation takes \(\Omega(n\log n)\) bits. But suppose that our vertices are unlabeled. Let \(r(n)\) be the number of unlabeled trees on \(n\) vertices with a specified root vertex (OEIS:A000081). Then, as \(n \rightarrow \infty\), \[ r(n) \sim D \alpha^n / n^{3/2}, \] where \(D = 0.439\ldots\) and \(\alpha = 2.95\ldots\) are constants, so if we simply numbered these and used the number to identify the tree, we could use only \[ \log r(n) = 2n - \Theta(\log n) \] bits. Obviously, this isn’t a practical data structure (how can you perform tree operations on a number), but the point is that the upper bound on the most compact representation is actually linear. There’s a big gap between linear and \(n\log n\), so it’s an interesting question to ask how we could have practical succint trees. For more on this, take a look at this paper, Succinct Trees in Practice.
The point, of course, isn’t that we should expect to get all the way down to sub-linear space complexity (in fact, if we want to label nodes with up to \(\sigma\) different labels, we need an additional \(O(n\log \sigma)\) bits for that), but just that we shouldn’t be surprised if we can do much better than a naïve approach.
It turns out that this problem has been thought about before: for instance, there’s a 2009 paper by Germann, Joanis, and Larkin, called Tightly Packed Tries: How to Fit Large Models in Memory, and Make them Load Fast, Too. I implemented their data structure, which encodes the nodes in depth-first order as follows. For each node, write its frequency in LEB128, then write the frequency of its children, also in base-128. If it is not a leaf node, then we have already written all of its children, since the encoding is depth-first, so we can compute the byte offset of the child node from the current node. Finally, we write a list of (key, offset) pairs, with an index size in front so we know when the index ends.
This is maybe a little confusing, but there’s an annotated version of the binary format for a simple example here that makes it more clear.
The TPT data structure is basically the same as the original trie structure: you have nodes, which store some frequency counts, and a list of child nodes. But it’s much more efficient, for two reasons.
The first is the use of a variable-length encoding for integers. Zipf’s Law is the observation that much data in linguistics and other sciences follows a power-law distribution. From Wikipedia: “Zipf’s law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.”
The LEB128 code has a slight overhead: 1 bit per byte. But the vast majority of the frequency counts will be small, with only a few large numbers, so it’s much more efficient.
The second reason is that instead of storing pointers to the child nodes, we store relative byte offsets, which are LEB128-encoded. By using relative addressing instead of absolute addressing, the numbers tend to be smaller, since we usually write child nodes close to their parents in the file. Smaller numbers mean fewer bytes written. Moreover, relative offsets mean that we don’t need to put things in physical memory: to deal with large data sets, just mmap
and call it a day.
In the toy example I linked, we write the whole trie in 53 bytes, instead of 280, so it’s more than five times smaller. For a bigger data set, like the English 1-million 2-grams, writing the trie this way takes 710 MB, compared to about 7 GB with the previous method (the original data set is ~80 GB).