You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 27 Next »

Current state: ["Under Discussion"]

ISSUE: #17599

PRs: 

Keywords: search, range search, radius

Released: 

Summary(required)

At present, the behavior of "search" in Milvus is to return TOPK most similar results for each vector to be queried.

The purpose of this MEP is to realize "range search" function. User specifies a "radius", Milvus does "range search" and returns all results with distance better than this "radius" ("> radius" for IP, "< radius" for other metric types).

This function can be regarded as a "superset" of the existing search function, because if sort the results of range search and take the first `topk` results, we get identical results comparing with current search function.

The result output of search and range search are different. The original search result is with fixed length 'nq * topk', while the results of range search is with variable length. Except 'IDs' and 'distances', 'lims' is also returned which is used to record the offset of the range search result of each vector in the result set.

Another MEP pagination will uniformly process the results of 'search' and 'range search' and return to the client, so the result processing will not be discussed in this MEP.

Motivation(required)

User (developing recommendation system) ask Milvus to realize the function of searching by distance, that is, to return all results with similarity better than a certain "radius".

Public Interfaces(optional)

  • No interface change in Milvus and all SDKs

We reuse the SDK interface "search" to do "range search", so the interface of Milvus and all SDKs need not to be changed. We only need add "radius" information into params. When "radius" is specified, the parameter "limit" is ignored.

As shown in the following figure, set "radius: 888" in search_ params.params.

  default_index = {"index_type": "HNSW", "params":{"M": 48, "efConstruction": 500}, "metric_type": "L2"}
  collection.create_index("float_vector", default_index)
  collection.load()  
  search_params = {"metric_type": "L2", "params": {"ef": 32, "radius": 888}}
  res = collection.search(vectors[:nq], "float_vector", search_params, limit, "int64 >= 0")


  • Need add new interface `QueryByRange()` in knowhere
  virtual DatasetPtr
  QueryByRange(const DatasetPtr& dataset, const Config& config, const faiss::BitsetView bitset);


There is one proposal that not to add new interface QueryByRange(), but reusing interface Query() for range search to minimize Knowhere's public interface.

Considering the implementation in Knowhere, we prefer not to adopt this proposal, because not all types of index support range search.

By now, Knowhere totally can support 13 types of index, but only 8 of them can support range search (the cell filled with BLUE):

BinaryIDMAPBinaryIVF


IDMAPIVF_FLATIVF_FLAT_NMIVF_PQIVF_SQ8
HNSWANNOY


RHNSW_FLATRHNSW_IVFRHNSW_SQRHNSW_PQ

If add new interface QueryByRange(),  we can add a virtual QueryByRange() into "VecIndex.h". If QueryByRange() is called for an index type which does not support range search, a knowhere exception will be thrown out.

virtual DatasetPtr
QueryByRange(const DatasetPtr& dataset, const Config& config, const faiss::BitsetView bitset) {
    KNOWHERE_THROW_MSG("QueryByRange not supported yet");
}

But if we reuse interface Query() for range search, the implementation of Query() for all index types will be changed like this:

virtual DatasetPtr
Query(const DatasetPtr& dataset, const Config& config, const faiss::BitsetView bitset) {
    if (radius exist in config) {
        // do range search
    } else {
        // do search
    }
}

This is not a good code design for Knowhere.

Design Details(required)

The following figure shows the complete call stack of a search request from SDK to segcore. Range search completely reuses this call stack without any changes. Range search only needs to put the parameter "radius" into the search parameter "param" on the SDK side.


The following figure shows the call stack diagram of vector search in segcore, and the BLACK part shows the functions that have already been implemented.

For sealed segment, to realize the range search function, knowhere needs to provide the interface QueryByRange();

For growing segment, because there is no index created, we can't use knowhere IDMAP to realize the brute force search function. We can only re-realize the full set of logic of brute force search by ourselves. To implement range search, we need to implement the function shown in the RED part.

Another solution is that Knowhere provides a new IDMAP index, which does not need to insert vector data, but only needs to specify the external memory address of vector data. Growing segment can temporarily generate this kind of index during search, and then call the Query() & QueryByRange() interface provided by the IDMAP, and the index will be destroyed immediately after it is used up. 

There is another MEP 34 -- IDMAP/BinaryIDMAP Enhancement describing this proposal.

Result Handling

Several data structure related with query result need be enhanced to support range search.

  • SubSearchResult

It is used to store each chunk's search result in a segment. 

Need add a new field "lims_" and a new method "merge_range()". Because the behavior of merge between Query() and QueryByRange() differs a lot. For Query(), after merging two "nq * topk" results, we still get "nq * topk" result. But for QueryByRange(), when merge A (with "n0" results) and B (with "n1" results), we get "n0 + n1" results.

class SubSearchResult {
    ... ...
    void
    merge(const SubSearchResult& sub_result);

    void
    merge_range(const SubSearchResult& sub_result);		// <==== new added

 private:
    ... ...
    std::vector<int64_t> seg_offsets_;
    std::vector<float> distances_;
    std::vector<size_t> lims_;			// <==== new added
}
  • SearchResult

It is used to store the final result of one segment search. we need add a new field "lims".

struct SearchResult {
    ... ...
 public:
    ... ...
    std::vector<float> distances_;
    std::vector<int64_t> seg_offsets_;
    std::vector<size_t> lims_;			// <==== new added
    ... ...
};
  • ReduceHelper

It's used to reduce all search results from all segments in a query node.

  • reduceSearchResultData in Proxy

It's used to reduce all search results from multiple query nodes.


Need discuss the interface between Pagination and range search.


Compatibility, Deprecation, and Migration Plan(optional)

  • What impact (if any) will there be on existing users?
  • If we are changing behaviors how will we phase out the older behavior?
  • If we need special migration tools, describe them here.
  • When will we remove the existing behavior?

Test Plan(required)

We can use Query() testcases to test QueryByRange() with little case change.

The main difference between Query() and QueryByRange() is, Query() using 'topk' while QueryByRange() using 'radius'.

Do following:

  1. do Query() firstly and get 'nq * topk' results
  2. for L2, find the max distance among "nq * topk' results, let 'radius = (max distance)^(1/2)'
  3. for IP, find the min distance among "nq * topk' results, let 'radius = min distance'
  4. for other binary metric types, find the max distance among "nq * topk' results, let 'radius = max distance'
  5. do QueryByRange() with above 'radius', the result will be a superset of result of Query()

For index HNSW, we should set "range_k" before call QueryByRange().

There is no public data set for range search. I have created one range search data set based on sift1M.

You can find them in NAS:

test/milvus/ann_hdf5/sift-128-euclidean-range.hdf5

test/milvus/ann_hdf5/glove-200-angular-range.hdf5

test/milvus/ann_hdf5/binary/sift-4096-hamming-range.hdf5

Rejected Alternatives(optional)

If there are alternative ways of accomplishing the same thing, what were they? The purpose of this section is to motivate why the design is the way it is and not some other ways.

References(optional)

Briefly list all references

  • No labels