Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Keywords: search, range search, radius, low bound, high bound

Released: 

Summary(required)

...

The purpose of this MEP is to realize "range search" function. User specifies a "range" (including radius low bound, or radius high bound, or both), Milvus does "range search" to get TOPK results with distance falling in this "range".

  • only set radius low bound, return TOPK results with "low_bound <= distance"
  • only set radius high bound
  • set both radius low bound and high bound

...

  • , return TOPK results with "distance < high_bound". (These TOPK results returned by range search will be identical with the TOPK results returned by search

...

  • .

...

  • )
  • set both radius low bound and high bound, return TOPK results with "low_bound <= distance < high_bound"

The result output of search and range search are different. Search result is with fixed length 'nq * topk', while range search result is with variable length. Except 'IDs' and 'distances', 'lims' is also returned which is used to record the offset of 'ID' and 'DISTANCE' for each vector.

...

Motivation(required)

User requests Milvus to realize a function to get all results with distance falling in a "range". With this capability, user can get arbitrary TOPK search results, while currently search can mostly return 16384 results.

...

We reuse the SDK interface "search" Search() to do "range search", so the interface of Milvus and all SDKs need not to be changed. We only need add 2 more parameters "radius" information _low_bound" and "radius_high_bound" into params. When

When "radius_low_bound" or "radius_high_bound" is specified, the parameter "limit" is ignoredMilvus does range search; otherwise, Milvus does search.

As shown in the following figure, set "radius: 888" _low_bound": 1.0, "radius_high_bound": 2.0 in search_ params.params.

Code Block
languagepy
  default_index = {"index_type": "HNSW", "params":{"M": 48, "efConstruction": 500}, "metric_type": "L2"}
  collection.create_index("float_vector", default_index)
  collection.load()  
  search_params = {"metric_type": "L2", "params": {"ef": 32, "radius_low_bound": 8881.0, "radius_high_bound": 2.0}}
  res = collection.search(vectors[:nq], "float_vector", search_params, limit, "int64 >= 0")
  • Need add new interface `QueryByRange()` in knowhere
Code Block
languagecpp
  virtual DatasetPtr
  QueryByRange(const DatasetPtr& dataset, const Config& config, const faiss::BitsetView bitset);

...


  • No interface change in knowhere

Knowhere reuses interface Query() for range search

...

Considering the implementation in Knowhere, we prefer not to adopt this proposal, because not all types of index support range search.

By now, Knowhere totally can support 13 types of index, but only 8 of them can support range search (the cell filled with BLUE):

...

.

When "radius_low_bound" or "radius_high_bound" is specified in config, knowhere does range search; otherwise, knowhere does search.

Code Block
languagecpp
  virtual DatasetPtr
QueryByRange(const DatasetPtr& dataset, const Config& config, const faiss::BitsetView bitset) {
    KNOWHERE_THROW_MSG("QueryByRange not supported yet");
}

...

Query(

...

Code Block
languagecpp
virtual DatasetPtr
Query(const DatasetPtr& dataset, const Config& config, const faiss::BitsetView bitset) {
    if (radius exist in config) {
        // do range search
    } else {
        // do search
    }
}

This is not a good code design for Knowhere.

;

Design Details(required)

The following figure shows the complete Range search completely reuses the call stack of a search request from SDK to segcore, then to knowhere.

Range search completely reuses this call stack without any changes. Range search only needs to put the parameter "radius_low_bound" and "radius_high_bound" into the search parameter "param" on the SDK side.

Image Removed

The following figure shows the call stack diagram of vector search in segcore.

The BLACK part shows the functions that have already been implemented.

The RED part shows the functions that need be added for range search.

Image Removed

Result Handling

Several data structure related with query result need be enhanced to support range search.

  • SubSearchResult

It is used to store each chunk's search result in a segment. 

Need add a new field "lims_" and a new method "merge_range()". Because the behavior of merge between Query() and QueryByRange() differs a lot. For Query(), after merging two "nq * topk" results, we still get "nq * topk" result. But for QueryByRange(), when merge A (with "n0" results) and B (with "n1" results), we get "n0 + n1" results.

Code Block
languagecpp
class SubSearchResult {
    ... ...
    void
    merge(const SubSearchResult& sub_result);

    void
    merge_range(const SubSearchResult& sub_result);		// <==== new added

 private:
    ... ...
    std::vector<int64_t> seg_offsets_;
    std::vector<float> distances_;
    std::vector<size_t> lims_;			// <==== new added
}
  • SearchResult

It is used to store the final result of one segment search. we need add a new field "lims".

Code Block
languagecpp
struct SearchResult {
    ... ...
 public:
    ... ...
    std::vector<float> distances_;
    std::vector<int64_t> seg_offsets_;
    std::vector<size_t> lims_;			// <==== new added
    ... ...
};
  • ReduceHelper

It's used to reduce all search results from all segments in a query node.

  • reduceSearchResultData in Proxy

It's used to reduce all search results from multiple query nodes.

Need discuss the interface between Pagination and range search.


In Knowhere, range search reuses interface Query() with search. When "radius_low_bound" or "radius_high_bound" is set, range search is called; otherwise, search is called. 

The output of range search is exactly as same as the result of search.

  • seg_offsets_: with length "nq * topk", -1 is filled in when no enough result data
  • distances_: whth lengh "nq * topk", float_max (for "L2") or float_min (for "IP") is filled in when no enough result data

Compatibility, Deprecation, and Migration Plan(optional)

  • What impact (if any) will there be on existing users?
  • If we are changing behaviors how will we phase out the older behavior?
  • If we need special migration tools, describe them here.
  • When will we remove the existing behavior?

Test Plan(required)

We can use Query() reuse search testcases to test QueryByRange() with little case change.

The main difference between Query() and QueryByRange() is, Query() using 'topk' while QueryByRange() using 'radius'.

Do following:

  1. do Query() firstly and get 'nq * topk' results
  2. for L2, find the max distance among "nq * topk' results, let 'radius = (max distance)^(1/2)'
  3. for IP, find the min distance among "nq * topk' results, let 'radius = min distance'
  4. for other binary metric types, find the max distance among "nq * topk' results, let 'radius = max distance'
  5. do QueryByRange() with above 'radius', the result will be a superset of result of Query()

...

range search.

  1. do search, get result1
  2. set "radius_high_bound"
  3. do search again, get range search result2
  4. if "radius_high_bound" is too small, result2 should be a subset of result1; otherwise, result2 should be identical with result1



There is no public data set for range search. I have created one range search data set based on sift1M.

...

If there are alternative ways of accomplishing the same thing, what were they? The purpose of this section is to motivate why the design is the way it is and not some other ways.


References(optional)

Milvus Search Flow

Image Added


Segcore Search Flow

Image AddedBriefly list all references