Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Current state: ["Under DiscussionApproved"]

ISSUE: #17599

PRs: 

Keywords: search, range search, radius, low bound, high boundrange filter

Released: 

Summary(required)

By now, doing the behavior of  "search" in Milvus will return is returning TOPK most similar sorted results for each queried vector being searched.

This MEP is about to realize a functionality named "range search". User specifies a range scope -- including radius low bound and radius high boundand range filter (optional), Milvus does "range search", and also returns TOPK sorted results with distance falling in this range scope.

  • both radius low bound and radius high bound param "radius" MUST be set, param "range filter" is optionalrange search will return TOPK results with
  • for L2: low_bound <= distance < high_bound
  • for IP: low_bound < distance <= high_bound
  • falling in range scope means
Metric typeBehavior
IPradius < distance <= range_filter
L2 and other metric typesrange_filter <= distance < radius

Motivation(required)

Many users request Milvus to realize a the "range search" functionality. With this capability, user can

...

We reuse the SDK interface Search() to do "range search". Only add 2 more parameters "radius_low_bound" and "radiusrange_high_boundfilter" into params.

When When param "radius_low_bound" and "radius_high_bound" are " is specified, Milvus does range search; otherwise, Milvus does search.

As shown in the following figure, set "radiusrange_low_boundfilter": 1.0, "radius_high_bound": 2.0 in search_ params.params.

Code Block
languagepy
  default_index = {"index_type": "HNSW", "params":{"M": 48, "efConstruction": 500}, "metric_type": "L2"}
  collection.create_index("float_vector", default_index)
  collection.load()  
  search_params = {"metric_type": "L2", "params": {"ef": 32, "radiusrange_low_boundfilter": 1.0, "radius_high_bound": 2.0}}
  res = collection.search(vectors[:nq], "float_vector", search_params, limit, "int64 >= 0")

...


IPL2HAMMINGJACCARDTANIMOTOSUBSTRUCTURESUPERSTRUCTURE
BIN_IDMAP

  •   
  •   
  •   


BIN_IVF_FLAT

  •   
  •   
  •   


IDMAP
  •   
  •   





IVF_FLAT
  •   
  •   





IVF_PQ
  •   
  •   





IVF_SQ8
  •   
  •   





HNSW
  •   
  •   





ANNOY






DISKANN
  •  
  •   





If call range search API with unsupported index types or unsupported metric types, knowhere exception will throw be thrown out exception.

In Knowhere, only one parameter two new parameters "radius" is and "range_filter" are passed in via config, and range search will return all un-sorted results with distance "better" than radius.

  • For IP: > radius
  • For other metric types: < radius

falling in this range scope.

Metric typeBehavior
IPradius < distance <= range_filter
L2 or other metric typesrange_filter <= distance < radius

Knowhere run range search in 2 steps:

  1. pass param "radius" to thirdparty to call their range search APIs, and get result
  2. if param "range_filter " is set, filter above result and return; if not, return above result directly


We add 3 new APIs CheckRangeSearch(We add 3 new APIs CheckRangeSearch(), QueryByRange() and BruteForce::RangeSearch() to support range search, these APIs are already available since knowhere-v1.3.0..

  1. CheckRangeSearchCheckRangeSearch()

This API is used to do range search parameter legacy check. It will throw exception when parameter legacy check fail.

...

metric typerangesimilarnot similar
L2([0, inf)small0largeinf
IP[-1, 1]1-1
jaccard[0, 1]01
tanimoto[0, 0.5]inf)00.5inf
hamming[0, n]0n

2. QueryByRange()

This API does range search for index, it returns all unsorted results with distance "better than radius" (for IP: > radius; for others: < radius).falling in the specified range scope.

PROTO
virtual DatasetPtr
QueryByRange(
PROTO
virtual DatasetPtr
QueryByRange(const DatasetPtr& dataset, const Config& config, const faiss::BitsetView bitset)

INPUT

Dataset {
    knowhere::meta::TENSOR: -   // query data
    knowhere::meta::ROWS: -      // rows of queries
    knowhere::meta::DIM: -          // dimension
}

Config {

    knowhere::meta::RADIUS: -   // radius for range search   

    knowhere::meta::RANGE_FILTER: -   

}

OUTPUT

Dataset {
    knowhere::meta::IDS: -                // result IDs with length LIMS[nq]
    knowhere::meta::DISTANCE: -  // result DISTANCES with length LIMS[nq]
    knowhere::meta::LIMS: -            // result offset prefix sum with length nq + 1
}

...

The range search result for each query vector is: IDS[lims[n], lims[n+1]) and DISTANCE[lims[n], lims[n+1])


The memory used for IDS, DISTANCE and LIMS are allocated in Knowhere, they will be auto-freed when Dataset deconstruction.

...

PROTO
static DatasetPtr
RangeSearch(const DatasetPtr base_dataset,
const DatasetPtr query_dataset,
const Config& config,
const faiss::BitsetView bitset);

INPUT

Dataset {
    knowhere::meta::TENSOR: -   // base data
    knowhere::meta::ROWS: -      // rows of base data
    knowhere::meta::DIM: -          // dimension
}

Dataset {
    knowhere::meta::TENSOR: -   // query data
    knowhere::meta::ROWS: -      // rows of queries
    knowhere::meta::DIM: -          // dimension
}

Config {

    knowhere::meta::RADIUS: -   // radius for range search

}

OUTPUT

Dataset {
    knowhere::meta::RANGE_FILTER: -   

}

OUTPUT

Dataset {
    knowhere::meta::IDS: -                // result IDs with length LIMS[nq]
    knowhere::meta::DISTANCE: -  // result DISTANCES with length LIMS[nq]
    knowhere::meta::LIMS: -            // result offset prefix sum with length nq + 1
}

...

Segcore search flow will be updated as this flow chart, range search related change is marked RED.

In Segcore , it uses radius parameters parameter's existence to decide whether to run range search, or to run range search. When "radius_low_bound" and "radius_high_bound" are

  • when param "radius" is set, range search is called; 
  • otherwise, search is called. 

For API query::SearchOnSealedIndex() and BruteForceSearch(), they do like following:

  1. pass in correct radius parameter to knowhere (For IP: radius_low_bound; for others: radius_high_bound)radius parameters (radius / range_filter) to knowhere 
  2. get all unsorted range search result from knowhere
  3. for each NQ's results, do sort, then filter using another radius parameterheap-sort
  4. return re-generate result Dataset with TOPK results

...

How to get more than 16384 range search results ?

With above this solution, user can get maximum 16384 range search results in one call.

If user wants to get more than 16384 range search results, they can call range search multiple times with different radius range_filter parameter (use L2 as an example)

  • NQ = 1

1st call with (radiusrange_low_bound filter = 0.0, radius _high_bound = inf), get result distances like this:

{d(0), d(1), d(2), ..., d(n-1)}

2nd call with (radiusrange_low_bound filter = d(n-1), radius _high_bound = inf), get result distances like this:

{d(n), d(n+1), d(n+2), ..., d(2n-1)}

3rd call with (radiusrange_low_bound filter = d(2n-1), radius _high_bound = inf), get result distances like this:

...

  • NQ > 1 (suppose NQ = 2)

1st call with (radiusrange_low_bound filter = 0.0, radius _high_bound = inf), get result distances like this: 

{d(0,0), d(0,1), d(0,2), ..., d(0,n-1), d(1,0), d(1,1), d(1,2), ..., d(1,n-1)}

2nd call with (radiusrange_low_bound filter = min{d(0,n-1), d(1,n-1)}, radius _high_bound = inf), get result distances like this: 

{d(0,n), d(0,n+1), d(0,n+2), ..., d(0,2n-1), d(1,n), d(1,n+1), d(1,n+2), ..., d(1,2n-1)}

3rd call with (radiusrange_low_bound filter = min{d(0,2n-1), d(1,2n-1)}, radius _high_bound = inf), get result distances like this:

...

The result of each iteration will have some duplication with the result of previous iteration, use user need do duplication check and remove them.

...

Compatibility, Deprecation, and Migration Plan(optional)

There is no compatibility This is a new functionality, there is no compatibility issue.

Test Plan(required)

...

  1. Add new unittest
  2. Add benchmark to test range search runtime and recall
  3. Add benchmark to test using range search datasetQPS

There is no public dataset for range search. I have created range search data set based on sift1M and glove200.

You can find them in NAS:NAS:

  1. test/milvus/ann_hdf5/binary/sift-4096-hamming-range.hdf5
    1. base dataset and query dataset are identical with sift1m
    2. radius = 291.0
    3. result length for each nq is different
    4. total result num 1,063,078
  2. test/milvus/ann_hdf5/sift-128-euclidean-range.hdf5
    1. base dataset and query dataset are identical with sift1m
    2. radius = 186.0 * 186.0
    3. result length for each nq is different
    4. total result num 1,054,377
  3. test/milvus/ann_hdf5/sift-128-euclidean-range

...

  1. -multi.hdf5
    1. base dataset and query dataset are identical with sift1m
    2. ground truth IDs and Distances are identical with sift1m
    3. each nq's radius is set to the last ground truth distance
    4. result length for each nq is 100
    5. total result num 1,000,000
  2. test/milvus/ann_hdf5/glove-200-angular-range.hdf5

      ...

        1. base dataset and query dataset are identical with glove200
        2. radius = 0.52
        3. result length for each nq is different
        4. total result num 1,060,888

      Segcore

      1. Add new range search unittest

      ...

      1. use sift1M/glove200 dataset to test range search (radius _low_bound = = max_float / -1.0, radius_high_bound = 1000000), we expect to get identical result results as search


      Rejected Alternatives(optional)

      ...

      The project implementation of the previous proposal will be much more complicated than current proposal.is too complicated to achieve comparing with current proposal.

      Adv.

      1. Get all range search result in one call

      Cons:

      1. Need implement Merge operation for chunk, segment, query node and proxy
      2. Memory explosion caused by Merge
      3. Many API modification caused by invalid topk parameter

      ...

      Others

      Because the result length of range search from knowhere is variable, knowhere plan to afford another API to return the range search result count for each NQ.

      If there is user request to get all range search result in one call, query node team will afford another solution to save range search output of knowhere to S3 directly.

      References(optional)

      1. Knowhere Range Search Algorithm Introduction
      2. Range Search Introduction