Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1.  querynode reads segment's binlog files from Minio, and saves them into structure Blob (internal/querynode/segment_loader.go::loadSegmentFieldsData)

    Code Block
    type Blob struct {
        Key   string    // binlog file path
        Value []byte    // binlog file data
    }

    The data in Blob is deserialized, raw-data in it is saved into structure InsertData

  2.  querynode invokes search engine to get SearchResult (internal/query_node/query_collection.go::search)

    Code Block
    languagecpp
    // internal/core/src/common/Types.h
    struct SearchResult {
     ...
     public:
        int64_t num_queries_;
        int64_t topk_;
        std::vector<float> result_distances_;
    ​
     public:
        void* segment_;
        std::vector<int64_t> internal_seg_offsets_;
        std::vector<int64_t> result_offsets_;
        std::vector<std::vector<char>> row_data_;
    };


    At this time, only "result_distances_" and "internal_seg_offsets_" of "SearchResult" are filled into data.

    querynode reduces all SearchResult returned by segment, fetches all other fields' data, and saves them into "row_data_" with row-based format. (internal/query_node/query_collection.go::reduceSearchResultsAndFillData)

  3. querynode organizes SearchResult again, and save them into structure milvus.Hits

    Code Block
    // internal/proto/milvus.proto
    message Hits {
      repeated int64 IDs = 1;
      repeated bytes row_data = 2;
      repeated float scores = 3;
    }
  4.  Row-based data saved in milvus.Hits is converted to column-based data, and saved into schemapb.SearchResultData (internal/query_node/query_collection.go::translateHits)

    Code Block
    // internal/proto/schema.proto
    message SearchResultData {
      int64 num_queries = 1;
      int64 top_k = 2;
      repeated FieldData fields_data = 3;
      repeated float scores = 4;
      IDs ids = 5;
      repeated int64 topks = 6;
    }
  5.  schemapb.SearchResultData is serialized, encapsulated as internalpb.SearchResults, saved into SearchResultMsg, and send into pulsar channel (internal/query_node/query_collection.go::search)

    Code Block
    // internal/proto/internal.proto
    message SearchResults {
      common.MsgBase base = 1;
      common.Status status = 2;
      string result_channelID = 3;
      string metric_type = 4;
      repeated bytes hits = 5;  // search result data
    ​
      // schema.SearchResultsData inside
      bytes sliced_blob = 9;
      int64 sliced_num_count = 10;
      int64 sliced_offset = 11;
    ​
      repeated int64 sealed_segmentIDs_searched = 6;
      repeated string channelIDs_searched = 7;
      repeated int64 global_sealed_segmentIDs = 8;
    }
  6.  Proxy collects all SearchResultMsg from querynodes, gets schemapb.SearchResultData by deserialization, then gets milvuspb.SearchResults by reducing, finally send back to SDK visa gRPC. (internal/proxy/task.go::SearchTask::PostExecute)

    Code Block
    // internal/proto/milvus.proto
    message SearchResults {
      common.Status status = 1;
      schema.SearchResultData results = 2;
    }
  7.  SDK receives milvuspb.SearchResult


In above 2 data flows, we can see frequent format conversion between column-based data and row-based data (marked as RED dashed line).

If we use Arrow as all in-memory data format, we can:

  • omit the serialization and deserialization between SDK and proxy
  • remove all format conversion between column-based data and row-based data
  • use Parquet as binlog file format, and write from arrow data directly

Public Interfaces(optional)

...