...
Keywords: search, range search, radius, low bound, high bound
Released:
Summary(required)
...
The purpose of this MEP is to realize "range search" function. User specifies a "range" (including radius low bound, or radius high bound, or both), Milvus does "range search" to get TOPK results with distance falling in this "range".
- only set radius low bound, return TOPK results with "low_bound <= distance"
- only set radius high bound
- set both radius low bound and high bound
...
- , return TOPK results with "distance < high_bound". (These TOPK results returned by range search will be identical with the TOPK results returned by search
...
- .
...
- )
- set both radius low bound and high bound, return TOPK results with "low_bound <= distance < high_bound"
The result output of search and range search are different. Search result is with fixed length 'nq * topk', while range search result is with variable length. Except 'IDs' and 'distances', 'lims' is also returned which is used to record the offset of 'ID' and 'DISTANCE' for each vector.
...
Motivation(required)
User requests Milvus to realize a function to get all results with distance falling in a "range". With this capability, user can get arbitrary TOPK search results, while currently search can mostly return 16384 results.
...
We reuse the SDK interface "search" Search() to do "range search", so the interface of Milvus and all SDKs need not to be changed. We only need add 2 more parameters "radius" information _low_bound" and "radius_high_bound" into params. When
When "radius_low_bound" or "radius_high_bound" is specified, the parameter "limit" is ignoredMilvus does range search; otherwise, Milvus does search.
As shown in the following figure, set "radius: 888" _low_bound": 1.0, "radius_high_bound": 2.0 in search_ params.params.
Code Block | ||
---|---|---|
| ||
default_index = {"index_type": "HNSW", "params":{"M": 48, "efConstruction": 500}, "metric_type": "L2"} collection.create_index("float_vector", default_index) collection.load() search_params = {"metric_type": "L2", "params": {"ef": 32, "radius_low_bound": 8881.0, "radius_high_bound": 2.0}} res = collection.search(vectors[:nq], "float_vector", search_params, limit, "int64 >= 0") |
- Need add new interface `QueryByRange()` in knowhere
Code Block | ||
---|---|---|
| ||
virtual DatasetPtr
QueryByRange(const DatasetPtr& dataset, const Config& config, const faiss::BitsetView bitset); |
...
- No interface change in knowhere
Knowhere reuses interface Query() for range search
...
Considering the implementation in Knowhere, we prefer not to adopt this proposal, because not all types of index support range search.
By now, Knowhere totally can support 13 types of index, but only 8 of them can support range search (the cell filled with BLUE):
...
.
When "radius_low_bound" or "radius_high_bound" is specified in config, knowhere does range search; otherwise, knowhere does search.
Code Block | ||
---|---|---|
| ||
virtual DatasetPtr QueryByRange(const DatasetPtr& dataset, const Config& config, const faiss::BitsetView bitset) { KNOWHERE_THROW_MSG("QueryByRange not supported yet"); } |
...
Query( |
...
Code Block | ||
---|---|---|
| ||
virtual DatasetPtr Query(const DatasetPtr& dataset, const Config& config, const faiss::BitsetView bitset) { if (radius exist in config) { // do range search } else { // do search } } |
This is not a good code design for Knowhere.
; |
Design Details(required)
The following figure shows the complete Range search completely reuses the call stack of a search request from SDK to segcore, then to knowhere.
Range search completely reuses this call stack without any changes. Range search only needs to put the parameter "radius_low_bound" and "radius_high_bound" into the search parameter "param" on the SDK side.
The following figure shows the call stack diagram of vector search in segcore.
The BLACK part shows the functions that have already been implemented.
The RED part shows the functions that need be added for range search.
Result Handling
Several data structure related with query result need be enhanced to support range search.
SubSearchResult
It is used to store each chunk's search result in a segment.
Need add a new field "lims_" and a new method "merge_range()". Because the behavior of merge between Query() and QueryByRange() differs a lot. For Query(), after merging two "nq * topk" results, we still get "nq * topk" result. But for QueryByRange(), when merge A (with "n0" results) and B (with "n1" results), we get "n0 + n1" results.
Code Block | ||
---|---|---|
| ||
class SubSearchResult {
... ...
void
merge(const SubSearchResult& sub_result);
void
merge_range(const SubSearchResult& sub_result); // <==== new added
private:
... ...
std::vector<int64_t> seg_offsets_;
std::vector<float> distances_;
std::vector<size_t> lims_; // <==== new added
} |
SearchResult
It is used to store the final result of one segment search. we need add a new field "lims".
Code Block | ||
---|---|---|
| ||
struct SearchResult {
... ...
public:
... ...
std::vector<float> distances_;
std::vector<int64_t> seg_offsets_;
std::vector<size_t> lims_; // <==== new added
... ...
}; |
ReduceHelper
It's used to reduce all search results from all segments in a query node.
reduceSearchResultData in Proxy
It's used to reduce all search results from multiple query nodes.
Need discuss the interface between Pagination and range search.
In Knowhere, range search reuses interface Query() with search. When "radius_low_bound" or "radius_high_bound" is set, range search is called; otherwise, search is called.
The output of range search is exactly as same as the result of search.
- seg_offsets_: with length "nq * topk", -1 is filled in when no enough result data
- distances_: whth lengh "nq * topk", float_max (for "L2") or float_min (for "IP") is filled in when no enough result data
Compatibility, Deprecation, and Migration Plan(optional)
- What impact (if any) will there be on existing users?
- If we are changing behaviors how will we phase out the older behavior?
- If we need special migration tools, describe them here.
- When will we remove the existing behavior?
Test Plan(required)
We can use Query() reuse search testcases to test QueryByRange() with little case change.
The main difference between Query() and QueryByRange() is, Query() using 'topk' while QueryByRange() using 'radius'.
Do following:
- do Query() firstly and get 'nq * topk' results
- for L2, find the max distance among "nq * topk' results, let 'radius = (max distance)^(1/2)'
- for IP, find the min distance among "nq * topk' results, let 'radius = min distance'
- for other binary metric types, find the max distance among "nq * topk' results, let 'radius = max distance'
- do QueryByRange() with above 'radius', the result will be a superset of result of Query()
...
range search.
- do search, get result1
- set "radius_high_bound"
- do search again, get range search result2
- if "radius_high_bound" is too small, result2 should be a subset of result1; otherwise, result2 should be identical with result1
There is no public data set for range search. I have created one range search data set based on sift1M.
...
If there are alternative ways of accomplishing the same thing, what were they? The purpose of this section is to motivate why the design is the way it is and not some other ways.
References(optional)
Milvus Search Flow
Segcore Search Flow
Briefly list all references