...
Code Block | ||
---|---|---|
| ||
type StringField inteface { extract(segmentOffsets []int32) []string serialize() []bytes deserialize([]bytes) } func Filter(expression string, field StringField) sgementOffsets []int32 |
...
Code Block | ||
---|---|---|
| ||
class HistoricalStringField1 { std::vector<string> strs; std::unordered_map<int32, std::vector<int32>> strOffsetToSegOffsets; std::vector<int32> segOffsetToStrOffset; std::vector<string> extract(const std::vector<int32>& segmentOffsets); std::vector<Blob> serialize(); void deserialize(const std::vector<Blob>&) } class Blob { std::vector<char> blob_; } |
...
map = { 'ape': 32, 'ball': 2, 'atom': 16, 'ate': 18, 'bait': 5 }
...
In the following tests, the memory usage was measured for 3 million unique Unicode Russian words; "simple lookup time " was measured on a lookup for a specific one word.
Description | memory usage (MegaBytes) | Unicode | Lookup time (nanosecond) | Reverse-lookup | |
---|---|---|---|---|---|
PAT-Trie | a pointer-based implementation of PAT-trie (aka Patricia-Trie and Radix-Trie) and may use a lot of memory because of that. | 242 | No | 333 | No |
HAT-Trie | Trie-HashMap hybrid. It is claimed to be the state-of-art Trie-like structure with the fastest lookups. | 125 | Yes | 195 | No |
DA-Trie | Double-Array Trie C implementation | 101 | Yes | 281 | No |
Marisa-Trie | memory-efficient recursive LOUDS-trie-based data structure implemented as C++ library | 11 | Yes | 2010 | Yes |
DAWG | Directed Acyclic Word Graphs | 2.8 | Yes | 249 | No |
Python-Dict | / | 600 | |||
Python-list | / | 300 |
...
Code Block | ||
---|---|---|
| ||
class StreamingStringField { std::vector<string> strs; std::vector<string> extract(const std::vector<int32>& segmentOffsets); std::vector<Blob> serialize(); void deserialize(const std::vector<Blob>&) } class Blob { std::vector<char> blob_; } |
...