...
Tries (also known as radix trees or prefix trees) are data structures that are typically used to store where the keys are usually .
For one, nodes in the tree do not store keys. Instead, they each store parts of keys. Traversing down from the root node to a leaf allows you to build the key as you progress. Also, there doesn't need to be a value at every node. In fact, values are typically only associated with leaf nodes.
For example, below is a graphic that is a representation of a trie that contains the following associative array.
map = { 'ape': 32, 'ball': 2, 'atom': 16, 'ate': 18, 'bait': 5 }
Tries have many advantages over their counterpart, the . They are used in many string search applications such as auto-complete, text search, and prefix matching. , a kind of trie, are often used in IP routing.
There are many variant implementations of Trie. The main consideration when we choose an implementation is whether it can support reverse lookup and save memory. The reverse loop means that we can get the original string from Tries through a certain ID.
In the following tests, the memory usage was measured for 3 million unique Unicode Russian words; "simple lookup" was a lookup for a specific word.
Description | memory usage (MegaBytes) | Unicode | Lookup time (nanosecond) | Reverse-lookup | |
---|---|---|---|---|---|
PAT-Trie | a pointer-based implementation of PAT-trie (aka Patricia-Trie and Radix-Trie) and may use a lot of memory because of that. | 242 | No | 333 | No |
HAT-Trie | Trie-HashMap hybrid. It is claimed to be the state-of-art Trie-like structure with the fastest lookups. | 125 | Yes | 195 | No |
DA-Trie | Double-Array Trie C implementation | 101 | Yes | 281 | No |
Marisa-Trie | memory-efficient recursive LOUDS-trie-based data structure implemented as C++ library | 11 | Yes | 2010 | Yes |
DAWG | Directed Acyclic Word Graphs | 2.8 | Yes | 249 | No |
Python-Dict | / | 600 | |||
Python-list | / | 300 |
The following gives another C++ definition of HistoricalStringField.
Code Block | ||
---|---|---|
| ||
class HistoricalStringField2 { MarisaTrie trie_; std::vector<StrID> segOffsetToStrID; std::vector<string> extract(const std::vector<int32>& segmentOffsets); std::vector<Blob> serialize(); void deserialize(const std::vector<Blob>&) } class Blob { std::vector<char> blob_; } using StrID = int32; |
Thus, opeations ("==", "!=",) are transformed into lookup on the trie to get the corresponding segment offset .
Operations ("<=", "<", ">", ">=") needs to modify the underlying implementation of marisa to support range queries.
For the extract interface, you only need to retrieve the corresponding String according to segment offsets and segOffsetToStrID. When the StrID is obtained, the original string is retrieved through the reverse lookup of the trie.
The Streaming Segment mentioned here refers to the segment whose data keeps growing, also called the Growing Segment.
As the data keeps increasing, it is neither convenient to sort the string array nor to use trie. One possible implementation StreamingStringField is defined as follows.
Code Block | ||
---|---|---|
| ||
class StreamingStringField {
std::vector<string> strs;
std::vector<string>
extract(const std::vector<int32>& segmentOffsets);
std::vector<Blob>
serialize();
void
deserialize(const std::vector<Blob>&)
}
class Blob {
std::vector<char> blob_;
} |
Operations ("==", "!=", "<", "<=", ">", ">=") are transformed into brute force search on strs to get the corresponding string offset, which is also the segment offset .
The basic unit of IndexNode read and write is a Field of Segment. It is used to build an index file for a Field to speed up the search. You can build an index on either a vector field or a scalar field. When IndexNode receives an index request for a certain field of a Segment, it reads the original historical data from ObjectStorage, builds an index file for the Field, and writes the index file back to ObjectStorage.
For the String type Field, IndexNode needs to construct a HistoricalStringField object from the original data, and persist it in the object storage.
If the implementation of HistoricalStringField1 is adopted, what we need to persist are the strs member and the SegmentOffsetToStrOffset member.It is easy to recover strOffsetToSegOffsets based on segOffsetToStrOffset.
If the implementation of HistoricalStringField2 is adopted, what we need to persist are the trie member and the SegmentOffsetToStrID member. Marisa has implemented serialization and deserialization methods.
Compatibility, Deprecation, and Migration Plan
This is a new feature. There is no need to consider compatibility issues, no need to deprecate old interfaces, and no need to provide data migration tools.
Test Plan
Unit tests
CI tests
References
https://en.wikipedia.org/wiki/Trie