Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Current state: ["Under DiscussionApproved"]

ISSUE: #17599

PRs: 

Keywords: search, range search, radius, range filter

Released: 

Summary(required)

目前 Milvus 已经实现的查询是为每个待查询向量返回 topk 个最相似的结果。

本项目的是为了实现另一种查询功能,用户指定查询半径 "radius",Milvus 查询并返回与待查询向量间距离优于 "radius" ("< radius" for "L2"; "> radius" for "IP" ) 的所有结果。

这一功能可以认为是现有查询功能的"超集",因为如果对 range search 的结果排序再取前 topk 个结果,就是现有查询功能的返回结果。

本项目的结果输出不同于原来的查询结果,原来的查询结果是定长的 nq * topk,而 range search 的返回结果是不定长的,除了 ids / distances 还会返回 lims 用以记录每条向量的查询结果在结果集中的偏移量。有另一个 MEP Pagination 会统一处理 Query 和 QueryByRange 的结果并返回给客户端,因此返回结果的处理不在本 MEP 讨论范围中。

Motivation(required)

有用户(做推荐系统)提需求希望 Milvus 能实现按距离查询的功能,即返回相似度优于某一阀值的所有结果。

Public Interfaces(optional)

Briefly list any new interfaces that will be introduced as part of this proposal or any existing interfaces that will be removed or changed.

我们复用了 Query 接口来实现 range search 的功能,因此 Milvus 的接口及所有 SDK 的接口不需要改变,只需要在 params 中加入 "radius" 信息即可。

当指定 "radius" 则忽略 "limit" 设定。

By now, the behavior of  "search" in Milvus is returning TOPK most similar sorted results for each queried vector.

This MEP is about to realize a functionality named "range search". User specifies a range scope -- including radius and range filter (optional), Milvus does "range search", and also returns TOPK sorted results with distance falling in this range scope.

  • param "radius" MUST be set, param "range filter" is optional
  • falling in range scope means
Metric typeBehavior
IPradius < distance <= range_filter
L2 and other metric typesrange_filter <= distance < radius

Motivation(required)

Many users request the "range search" functionality. With this capability, user can

  • get results with distance falling in a range scope
  • get search results more than 16384

Public Interfaces(optional)

  • No interface change in Milvus and all SDKs

We reuse the SDK interface Search() to do "range search". Only add 2 more parameters "radius" and "range_filter" into params.

When param "radius" is specified, Milvus does range search; otherwise, Milvus does search.

As shown in the following figure, set "range_filter": 1.0, "radius": 2.0 in search_ params.params.

...

Code Block
languagepy
  default_index = {"index_type": "HNSW", "params":{"M": 48, "efConstruction": 500}, "metric_type": "L2"}
  collection.create_index("float_vector", default_index)
  collection.load()  
  search_params = {"metric_type": "L2", "params": {"ef": 32, "range_filter": 1.0, "radius": 8882.0}}
  res = collection.search(vectors[:nq], "float_vector", search_params, limit, "int64 >= 0")

...

  • 3 new interfaces in knowhere
Code Block
languagecpp
  // range search parameter legacy check
  virtual bool
  CheckRangeSearch(Config& cfg, const IndexType type, const IndexMode mode);
  
  // range search
  virtual DatasetPtr
  QueryByRange(const DatasetPtr& dataset, const Config& config, const faiss::BitsetView bitset);

  // brute force range search
  static DatasetPtr
  BruteForce::RangeSearch(const DatasetPtr base_dataset, const DatasetPtr query_dataset, const Config& config, const faiss::BitsetView bitset);

Design Details(required)

Describe the new thing you want to do in appropriate detail. This may be fairly extensive and have large subsections of its own. Or it may be a few sentences. Use judgment based on the scope of the change.

下图为一个 Search 请求从 SDK 到 SEGCORE 的完整调用栈,range search 完全复用该调用栈,不需要做任何改动。range search 只需要将参数 "radius" 在 SDK 端放入查询参数 "param" 中。

Image Removed

下图为 SEGCORE 内部向量查询时的调用栈示意图,黑色部分为现已实现的功能。

对于 sealed segment,要实现 range search 功能需要 knowhere 提供 QueryByRange 功能;

对于 growing segment 时,由于没有创建 index ,所以无法用 knowhere IDMAP 实现的暴搜功能,只能自己额外实现了暴搜的全套逻辑。若要实现 range search,需要另外实现红色部分所示的功能函数。

另一解决方案是,knowhere 提供一种新的 IDMAP 索引,该索引不需要插入向量数据,只需要指定向量数据的内存地址。growing segment 在查询时可临时生成该种索引,继而调用 IDMAP 提供的 Query & QueryByRange 接口,该索引在用完后也即刻销毁。此方案没有额外的内存开销,但是否可行需进一步调研。

Image Removed

查询结果的处理

Query 返回的查询结果有2种,一种为 SubSearchResult,用于存放 segment 中每个 chunk 的查询结果。

需要添加成员变量 "is_range_search_" 和 "lims_",以及成员函数 "merge_range()" 和 "merge_range_impl()"。

Code Block
languagecpp
class SubSearchResult {
    ... ...
    static SubSearchResult
    merge(const SubSearchResult& left, const SubSearchResult& right);

    void
    merge(const SubSearchResult& sub_result);

    static SubSearchResult
    merge_range(const SubSearchResult& left, const SubSearchResult& right);		// <==== new added

    void
    merge_range(const SubSearchResult& sub_result);								// <==== new added

 private:
    template <bool is_desc>
    void
    merge_impl(const SubSearchResult& sub_result);

    void
    merge_range_impl(const SubSearchResult& sub_result);						// <==== new added

 private:
    bool is_range_search_;				// <==== new added
    int64_t num_queries_;
    int64_t topk_;
    int64_t round_decimal_;
    MetricType metric_type_;
    std::vector<int64_t> seg_offsets_;
    std::vector<float> distances_;
    std::vector<size_t> lims_;			// <==== new added

另一种为 SearchResult,用于存放该 segment 的查询结果。

需要添加成员变量 "is_range_search_" 和 "lims_"。

Code Block
languagecpp
struct SearchResult {
    ... ...
 public:
    int64_t total_nq_;
    int64_t unity_topK_;
    void* segment_;

    // first fill data during search, and then update data after reducing search results
	bool is_range_search_;				// <==== new added
    std::vector<float> distances_;
    std::vector<int64_t> seg_offsets_;
    std::vector<size_t> lims_;			// <==== new added
    ... ...
};

在 Query node 中还需要实现多个 segment 之间的 reduce_range();

在 Proxy 中还需要实现多个 SearchResult 之间的 reduce_range();

Knowhere

Index types and metric types supporting range search are listed below:


IPL2HAMMINGJACCARDTANIMOTOSUBSTRUCTURESUPERSTRUCTURE
BIN_IDMAP

  •   
  •   
  •   


BIN_IVF_FLAT

  •   
  •   
  •   


IDMAP
  •   
  •   





IVF_FLAT
  •   
  •   





IVF_PQ
  •   
  •   





IVF_SQ8
  •   
  •   





HNSW
  •   
  •   





ANNOY






DISKANN
  •  
  •   





If call range search API with unsupported index types or unsupported metric types, exception will be thrown out.

In Knowhere, two new parameters "radius" and "range_filter" are passed in via config, and range search will return all un-sorted results with distance falling in this range scope.

Metric typeBehavior
IPradius < distance <= range_filter
L2 or other metric typesrange_filter <= distance < radius

Knowhere run range search in 2 steps:

  1. pass param "radius" to thirdparty to call their range search APIs, and get result
  2. if param "range_filter " is set, filter above result and return; if not, return above result directly


We add 3 new APIs CheckRangeSearch(), QueryByRange() and BruteForce::RangeSearch() to support range search.

  1. CheckRangeSearch()

This API is used to do range search parameter legacy check. It will throw exception when parameter legacy check fail.

The valid scope for "radius" is defined as: -1.0 <= radius <= float_max

metric typerangesimilarnot similar
L2[0, inf)0inf
IP[-1, 1]1-1
jaccard[0, 1]01
tanimoto[0, inf)0inf
hamming[0, n]0n

2. QueryByRange()

This API returns all unsorted results with distance falling in the specified range scope.

PROTO
virtual DatasetPtr
QueryByRange(const DatasetPtr& dataset, const Config& config, const faiss::BitsetView bitset)

INPUT

Dataset {
    knowhere::meta::TENSOR: -   // query data
    knowhere::meta::ROWS: -      // rows of queries
    knowhere::meta::DIM: -          // dimension
}

Config {

    knowhere::meta::RADIUS: -   

    knowhere::meta::RANGE_FILTER: -   

}

OUTPUT

Dataset {
    knowhere::meta::IDS: -                // result IDs with length LIMS[nq]
    knowhere::meta::DISTANCE: -  // result DISTANCES with length LIMS[nq]
    knowhere::meta::LIMS: -            // result offset prefix sum with length nq + 1
}

LIMS is with length "nq+1", it's the offset prefix sum for result IDS and result DISTANSE. The length of IDS and DISTANCE are the same but variable.

Suppose N queried vectors are with label: {0, 1, 2, ..., n-1}

The result counts for each queried vectors are: {r(0), r(1), r(2), ..., r(n-1)}

Then the data in LIMS will be like this: {0, r(0), r(0)+r(1), r(0)+r(1)+r(2), ..., r(0)+r(1)+r(2)+...+r(n-1)}

The total range search result num is: LIMS[nq]

The range search result for each query vector is: IDS[lims[n], lims[n+1]) and DISTANCE[lims[n], lims[n+1])


The memory used for IDS, DISTANCE and LIMS are allocated in Knowhere, they will be auto-freed when Dataset deconstruction.

3. BruteForce::RangeSearch()

This API does range search for no-index dataset, it returns all unsorted results with distance "better than radius" (for IP: > radius; for others: < radius).

PROTO
static DatasetPtr
RangeSearch(const DatasetPtr base_dataset,
const DatasetPtr query_dataset,
const Config& config,
const faiss::BitsetView bitset);

INPUT

Dataset {
    knowhere::meta::TENSOR: -   // base data
    knowhere::meta::ROWS: -      // rows of base data
    knowhere::meta::DIM: -          // dimension
}

Dataset {
    knowhere::meta::TENSOR: -   // query data
    knowhere::meta::ROWS: -      // rows of queries
    knowhere::meta::DIM: -          // dimension
}

Config {

    knowhere::meta::RADIUS: -   

    knowhere::meta::RANGE_FILTER: -   

}

OUTPUT

Dataset {
    knowhere::meta::IDS: -                // result IDs with length LIMS[nq]
    knowhere::meta::DISTANCE: -  // result DISTANCES with length LIMS[nq]
    knowhere::meta::LIMS: -            // result offset prefix sum with length nq + 1
}

The output is as same as QueryByRange().

Segcore

Segcore search flow will be updated as this flow chart, range search related change is marked RED.

Image Added

Segcore uses radius parameter's existence to decide whether to run search, or to run range search.

  • when param "radius" is set, range search is called; 
  • otherwise, search is called. 

For API query::SearchOnSealedIndex() and BruteForceSearch(), they do like following:

  1. pass radius parameters (radius / range_filter) to knowhere 
  2. get all unsorted range search result from knowhere
  3. for each NQ's results, do heap-sort
  4. return result Dataset with TOPK results

Whatever do range search or search, the output structure are same:

  • query::SearchOnSealedIndex() => SearchResult
  • BruteForceSearch() => SubSearchResult

Both SearchResult and SubSearchResult contain TOPK sorted result for each NQ.

  • seg_offsets_: with length "nq * topk", -1 is filled in when no enough result data
  • distances_: with lengh "nq * topk", data is undefined when no enough result data

Milvus

Range search completely reuses the call stack from SDK to segcore.

Image Added


How to get more than 16384 range search results ?

With this solution, user can get maximum 16384 range search results in one call.

If user wants to get more than 16384 range search results, they can call range search multiple times with different range_filter parameter (use L2 as an example)

  • NQ = 1

1st call with (range_filter = 0.0, radius = inf), get result distances like this:

{d(0), d(1), d(2), ..., d(n-1)}

2nd call with (range_filter = d(n-1), radius = inf), get result distances like this:

{d(n), d(n+1), d(n+2), ..., d(2n-1)}

3rd call with (range_filter = d(2n-1), radius = inf), get result distances like this:

{d(2n), d(2n+1), d(2n+2), ..., d(3n-1)}

...

  • NQ > 1 (suppose NQ = 2)

1st call with (range_filter = 0.0, radius = inf), get result distances like this: 

{d(0,0), d(0,1), d(0,2), ..., d(0,n-1), d(1,0), d(1,1), d(1,2), ..., d(1,n-1)}

2nd call with (range_filter = min{d(0,n-1), d(1,n-1)}, radius = inf), get result distances like this: 

{d(0,n), d(0,n+1), d(0,n+2), ..., d(0,2n-1), d(1,n), d(1,n+1), d(1,n+2), ..., d(1,2n-1)}

3rd call with (range_filter = min{d(0,2n-1), d(1,2n-1)}, radius = inf), get result distances like this:

{d(0,2n), d(0,2n+1), d(0,2n+2), ..., d(0,3n-1), d(1,2n), d(1,2n+1), d(1,2n+2), ..., d(1,3n-1)}

...

The result of each iteration will have some duplication with the result of previous iteration, user need do duplication check and remove them.

...


Compatibility, Deprecation, and Migration Plan(optional)

  • What impact (if any) will there be on existing users?
  • If we are changing behaviors how will we phase out the older behavior?
  • If we need special migration tools, describe them here.
  • When will we remove the existing behavior?

Test Plan(required)

Describe in a few sentences how the MEP will be tested. We are mostly interested in system tests (since unit tests are specific to implementation details). How will we know that the implementation works as expected? How will we know nothing broke?

Rejected Alternatives(optional)

If there are alternative ways of accomplishing the same thing, what were they? The purpose of this section is to motivate why the design is the way it is and not some other ways.

References(optional)

This is a new functionality, there is no compatibility issue.

Test Plan(required)

Knowhere

  1. Add new unittest
  2. Add benchmark to test range search runtime and recall
  3. Add benchmark to test range search QPS

There is no public dataset for range search. I have created range search data set based on sift1M and glove200.

You can find them in NAS:

  1. test/milvus/ann_hdf5/binary/sift-4096-hamming-range.hdf5
    1. base dataset and query dataset are identical with sift1m
    2. radius = 291.0
    3. result length for each nq is different
    4. total result num 1,063,078
  2. test/milvus/ann_hdf5/sift-128-euclidean-range.hdf5
    1. base dataset and query dataset are identical with sift1m
    2. radius = 186.0 * 186.0
    3. result length for each nq is different
    4. total result num 1,054,377
  3. test/milvus/ann_hdf5/sift-128-euclidean-range-multi.hdf5
    1. base dataset and query dataset are identical with sift1m
    2. ground truth IDs and Distances are identical with sift1m
    3. each nq's radius is set to the last ground truth distance
    4. result length for each nq is 100
    5. total result num 1,000,000
  4. test/milvus/ann_hdf5/glove-200-angular-range.hdf5
    1. base dataset and query dataset are identical with glove200
    2. radius = 0.52
    3. result length for each nq is different
    4. total result num 1,060,888

Segcore

  1. Add new range search unittest

Milvus

  1. use sift1M/glove200 dataset to test range search (radius = max_float / -1.0), we expect to get identical results as search


Rejected Alternatives(optional)

The previous proposal of this MEP is let range search return all results with distances better than a "radius".

The project implementation of the previous proposal is too complicated to achieve comparing with current proposal.

Adv.

  1. Get all range search result in one call

Cons:

  1. Need implement Merge operation for chunk, segment, query node and proxy
  2. Memory explosion caused by Merge
  3. Many API modification caused by invalid topk parameter

Others

Because the result length of range search from knowhere is variable, knowhere plan to afford another API to return the range search result count for each NQ.

If there is user request to get all range search result in one call, query node team will afford another solution to save range search output of knowhere to S3 directly.

References(optional)

  1. Knowhere Range Search Algorithm Introduction
  2. Range Search Introduction

...