Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

When the index file exists, QueryNode loads the index file from the object store, call the deserialize method of Stringfield, and generates a Stringfield object.


Another implementation of StringField of Historical Segment

Tries (also known as radix trees or prefix trees) are tree-based data structures that are typically used to store associative arrays where the keys are usually strings.

For one, nodes in the tree do not store keys. Instead, they each store parts of keys. Traversing down from the root node to a leaf allows you to build the key as you progress. Also, there doesn't need to be a value at every node. In fact, values are typically only associated with leaf nodes.

For example, below is a graphic that is a representation of a trie that contains the following associative array.

map = { 'ape': 32, 'ball': 2, 'atom': 16, 'ate': 18, 'bait': 5 }

Image Added


the "p" edge, and the "e" edge. When we reach a leaf, we can extract the value, which is 32.

Tries have many advantages over their counterpart, the hash table. They are used in many string search applications such as auto-complete, text search, and prefix matching. Radix trees, a kind of trie, are often used in IP routing.

There are many variant implementations of Trie. The main consideration when we choose an implementation is whether it can support reverse lookup and save memory. The reverse loop means that we can get the original string from Tries through a certain ID.

In the following tests, the memory usage was measured for 3 million unique Unicode Russian words; "simple lookup" was a lookup for a specific word.



Descriptionmemory usage (MegaBytes)UnicodeLookup time (nanosecond)Reverse-lookup
PAT-Triea pointer-based implementation of PAT-trie (aka Patricia-Trie and Radix-Trie) and may use a lot of memory because of that.242No333No
HAT-TrieTrie-HashMap hybrid. It is claimed to be the state-of-art Trie-like structure with the fastest lookups.125Yes195No
DA-TrieDouble-Array Trie C implementation101Yes281No
Marisa-Triememory-efficient recursive LOUDS-trie-based data structure implemented as C++ library11Yes2010Yes
DAWGDirected Acyclic Word Graphs2.8Yes249No
Python-Dict/600


Python-list/300



MARISA-trie is a very memory-efficient recursive trie-based data structure implemented as C++ library (repo). String data in a MARISA-trie may take up to 50x-100x less memory than in a standard Python dict; the raw lookup speed is comparable; trie also provides fast advanced methods like prefix search.MARISA-trie also support memory-mapped IO so it is possible to have on-disk trie and reduce the memory usage even further.

The following gives another C++ definition of HistoricalStringField.


Code Block
languagecpp
class HistoricalStringField2 {
MarisaTrie trie_;
std::vector<StrID> segOffsetToStrID;
std::vector<string> 
extract(const std::vector<int32>& segmentOffsets);
std::vector<Blob>
serialize();
void
deserialize(const std::vector<Blob>&)
}
class Blob {
std::vector<char> blob_;
}
using StrID = int32;