MEP 34 -- IDMAP/BinaryIDMAP Enhancement

Current state: ["Under Discussion"]

PRs:

Keywords: IDMAP, BinaryIDMAP, brute force search, chunk

Released: Milvus-v2.2.0

Summary(required)

In this MEP, we put forward an IDMAP/BinaryIDMAP Enhancement proposal that let knowhere index type IDMAP/BinaryIDMAP accept external vector data instead of adding real vector data in.

This Enhanced IDMAP/BinaryIDMAP can be used for growing segment searching to improve code reuse and reduce code maintenance effort.

Motivation(required)

Generally no one will create IDMAP/BinaryIDMAP index type for sealed segment, because it does not bring any search performance improvement but consumes identical size of memory and disk.

The only reasonable use scenario for IDMAP/BinaryIDMAP is for growing segment. But creating vector index is a resource consuming operation, because it involves all Milvus nodes in -- an index file is created by index node, saved by data node and loaded by query node, meanwhile proxy / rootcoord / indexcoord / datacoord / querycoord are also involved to coordinate all these operations.

So currently in Milvus, it uses following 2 strategies for growing segment searching:

small batch index for fully-filled chunks (this functionality is disabled for some particular reason)
brute force search for partial-filled chunks and no indexed fully-filled chunks (copied from knowhere IDMAP)

The advantage of this solution is resource saving, except query node, no other nodes will be involved in; while the shortcoming is code duplication. See following "Search Flow" chart, `FloatSearchBruteForce` and `BinarySearchBruteForce` are copied from knowhere::IDMAP and knowhere::BinaryIDMAP. More code duplicated, more effort on code maintenance. And when realize new feature on IDMAP/BinaryIDMAP in Knowhere, such as range search, same work need be re-done on Milvus segcore.

This is why enhanced IDMAP/BinaryIDMAP is proposed. For enhanced IDMAP/BinaryIDMAP, vector data is not really added in, but only set an external vector data pointer. User should guarantee that the memory is contiguous and safe.

In this way:

no CPU time, memory and disk consumption when creating index
resource saving, only query node is involved in
no code duplication for growing segment search
unified search result for sealed segment and growing segment

Public Interfaces(optional)

Briefly list any new interfaces that will be introduced as part of this proposal or any existing interfaces that will be removed or changed.

Faiss need add new field "xb_ex" and new interface "add_ex" for structure IndexBinaryFlat; also add new field "codes_ex" and new interface "add_ex" for structure IndexFlat.

In IndexBinaryFlat, "xb" and "xb_ex" are mutual exclusive, user cannot set them at the same time; it's same in IndexFlat, "codes" and "codes_ex" are also mutual exclusive, user cannot set both of them.

//============================================================================
struct IndexBinaryFlat : IndexBinary {
    /// database vectors, size ntotal * d / 8
    std::vector<uint8_t> xb;

    /// external database vectors, size ntotal * d / 8
    uint8_t* xb_ex = nullptr;								// <==== new added
    ... ...
}

void IndexBinaryFlat::add_ex(idx_t n, const uint8_t* x) {
    xb_ex = (uint8_t*)x;
    ntotal = n;
}

//============================================================================
struct IndexFlatCodes : Index {
    ... ...
    /// encoded dataset, size ntotal * code_size
    std::vector<uint8_t> codes;

    // external encoded dataset , size ntotal * code_size
    uint8_t* codes_ex = nullptr;							// <==== new added
    ... ...
}

void IndexFlatCodes::add_ex(idx_t n, const float* x) {
    FAISS_THROW_IF_NOT(is_trained);
    FAISS_THROW_IF_NOT(codes.empty());
    codes_ex = (uint8_t*)x;
    ntotal = n;
}

Knowhere need add a new interface `AddExWithoutIds()` for both IDMAP and BinaryIDMAP.

    // set external data pointer instead really add data
    void
    AddExWithoutIds(const DatasetPtr&, const Config&);

When Knowhere detect that "codes_ex" is used in current IDMAP index or "xb_ex" is used in current BinaryIDMAP index, serialization is banned.

For Milvus, the implementation of API "FloatSearchBruteForce()" and "BinarySearchBruteForce()" will be re-written, but the interface need not change.

Design Details(required)

Describe the new thing you want to do in appropriate detail. This may be fairly extensive and have large subsections of its own. Or it may be a few sentences. Use judgment based on the scope of the change.

In Milvus, when growing segment need create an enhanced IDMAP index, it can do in this way:

    auto idmap_index = std::make_shared<knowhere::IDMAP>();
    idmap_index->Train(train_dataset, conf);
    idmap_index->AddExWithoutIds(train_dataset, conf);				// <==== call ""AddExWithoutIds"
    auto result = idmap_index->Query(query_dataset, conf, bitset);

This enhanced IDMAP index cannot be serialized, and will be auto destroyed without any cost.

Compatibility, Deprecation, and Migration Plan(optional)

This MEP will be transparent for users, and will not introduce any compatibility issue.

Test Plan(required)

Knowhere need add some unittests to test new interface `AddExWithoutIds()`.

No extra testcases need be added in Milvus because current growing segment search testcases can cover this change.

Search result and performance will be identical with before.

References(optional)

Briefly list all references

Space shortcuts

Page tree