Current state: ["Under Discussion"]
ISSUE: #17754
PRs:
Keywords: IDMAP, BinaryIDMAP, brute force search, chunk
Released: Milvus-v2.2.0
In this MEP, we put forward an IDMAP/BinaryIDMAP Enhancement proposal that let knowhere index type IDMAP/BinaryIDMAP to hold an external vector data pointer instead of adding real vector data in.
This Enhanced IDMAP/BinaryIDMAP can be used for growing segment searching to improve code reuse and reduce code maintenance effort.
Generally no one will create IDMAP/BinaryIDMAP index type for sealed segment, because it does not bring any search performance improvement but consumes identical size of memory and disk.
The only reasonable use scenario for IDMAP/BinaryIDMAP is for growing segment. But if create an IDMAP/BinaryIDMAP index in a normal way, it will consume lots of resources, because it will involve index node (to create index file), data node (to save index file to S3) and rootcoord / indexcoord / datacoord (to coordinate all these operations).
So Milvus uses following 2 strategies for growing segment searching:
The advantage of this solution is resource saving, except query node, no other nodes will be involved in; while the shortcoming is code duplication.
See following "Search Flow" chart, `FloatSearchBruteForce` and `BinarySearchBruteForce` are copied from knowhere::IDMAP/BinaryIDMAP's interface Query() and modified a little. This will introduce more code maintenance effort. And when realize new feature on IDMAP/BinaryIDMAP in Knowhere, such as range search, we have to also copy these codes implementation to Milvus.
If we enhance IDMAP/BinaryIDMAP, not to add real vector data in, but only hold an external vector data pointer in the index, we can use knowhere::IDMAP/BinaryIDMAP's interface Query() directly without any costs. User need guarantee that the memory is contiguous and safe.
In this way:
Advantages: Little code change
Cons: Need add new interfaces in both Faiss and Knowhere
Faiss needs add new field "xb_ex" and new interface "add_ex" for structure IndexBinaryFlat; also add new field "codes_ex" and new interface "add_ex" for structure IndexFlat.
In IndexBinaryFlat, "xb" and "xb_ex" are mutual exclusive, user cannot set them at the same time; it's same in IndexFlat, "codes" and "codes_ex" are also mutual exclusive, user cannot set both of them.
//============================================================================ struct IndexBinaryFlat : IndexBinary { /// database vectors, size ntotal * d / 8 std::vector<uint8_t> xb; /// external database vectors, size ntotal * d / 8 uint8_t* xb_ex = nullptr; // <==== new added ... ... } void IndexBinaryFlat::add_ex(idx_t n, const uint8_t* x) { xb_ex = (uint8_t*)x; ntotal = n; } //============================================================================ struct IndexFlatCodes : Index { ... ... /// encoded dataset, size ntotal * code_size std::vector<uint8_t> codes; // external encoded dataset , size ntotal * code_size uint8_t* codes_ex = nullptr; // <==== new added ... ... } void IndexFlatCodes::add_ex(idx_t n, const float* x) { FAISS_THROW_IF_NOT(is_trained); FAISS_THROW_IF_NOT(codes.empty()); codes_ex = (uint8_t*)x; ntotal = n; } |
Knowhere needs add a new interface `AddExWithoutIds()` for both IDMAP and BinaryIDMAP.
// set external data pointer instead really add data void AddExWithoutIds(const DatasetPtr&, const Config&); |
When Knowhere detect that "codes_ex" is used in current IDMAP index or "xb_ex" is used in current BinaryIDMAP index, serialization is banned.
For Milvus, API "FloatSearchBruteForce()" and "BinarySearchBruteForce()" will be re-written, they can use the enhanced IDMAP/BinaryIDMAP to search instead of calling Faiss interfaces.
In Milvus, when growing segment need create an enhanced IDMAP index, it can do in this way:
auto idmap_index = std::make_shared<knowhere::IDMAP>(); idmap_index->Train(train_dataset, conf); idmap_index->AddExWithoutIds(train_dataset, conf); // <==== call ""AddExWithoutIds" auto result = idmap_index->Query(query_dataset, conf, bitset); |
This enhanced IDMAP index cannot be serialized, and will be auto destroyed without any cost.
Knowhere need add some unittests to test new interface `AddExWithoutIds()`.
No extra testcases need be added in Milvus because current growing segment search testcases can cover this change.
Search result and performance will be identical with before.
Briefly list all references