Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
languagecpp
type StringField inteface { 
extract(segmentOffsets []int32) []string

serialize() []bytes

deserialize([]bytes)

}
func Filter(expression string, field StringField) sgementOffsets []int32

...

Code Block
languagecpp
class HistoricalStringField1 {
std::vector<string> strs;

std::unordered_map<int32, std::vector<int32>> strOffsetToSegOffsets;

std::vector<int32> segOffsetToStrOffset;

std::vector<string> 
extract(const std::vector<int32>& segmentOffsets);

std::vector<Blob>
serialize();

void
deserialize(const std::vector<Blob>&)

}

class Blob {
std::vector<char> blob_;
}

...

map = { 'ape': 32, 'ball': 2, 'atom': 16, 'ate': 18, 'bait': 5 }


The word "ape" is built by traversing the "a" edge, the "p" edge, and the "e" edge. When we reach a leaf, we can extract the value, which is 32.

...

In the following tests, the memory usage was measured for 3 million unique Unicode Russian words; "simple lookup time " was measured on a lookup for a specific one word.



Descriptionmemory usage (MegaBytes)UnicodeLookup time (nanosecond)Reverse-lookup
PAT-Triea pointer-based implementation of PAT-trie (aka Patricia-Trie and Radix-Trie) and may use a lot of memory because of that.242No333No
HAT-TrieTrie-HashMap hybrid. It is claimed to be the state-of-art Trie-like structure with the fastest lookups.125Yes195No
DA-TrieDouble-Array Trie C implementation101Yes281No
Marisa-Triememory-efficient recursive LOUDS-trie-based data structure implemented as C++ library11Yes2010Yes
DAWGDirected Acyclic Word Graphs2.8Yes249No
Python-Dict/600


Python-list/300


...

Code Block
languagecpp
class StreamingStringField {

std::vector<string> strs;

std::vector<string> 
extract(const std::vector<int32>& segmentOffsets);

std::vector<Blob>
serialize();

void
deserialize(const std::vector<Blob>&)

}
class Blob {
std::vector<char> blob_;
}

...