Current state: "Under Discussion"

PRs:

Keywords: arrow/column-based/row-based

Released: Milvus 2.0

Summary

Milvus 2.0 is a cloud-native and multi-language vector database, we use gRPC and pulsar to communicate among SDK and components.

In consideration of the data size, especially when inserting and search result returning, Milvus takes a lot of CPU cycles to do serialization and deserialization.

In this enhancement proposal, we suggest to adopt Apache Arrow as Milvus in-memory data format. Since in the field of big data, Apache Arrow has been a

factor standard for in-memory analytics. It specifies a standardized language-independent columnar memory format.

Motivation(required)

From a data perspective, Milvus mainly includes 2 data flows:

Insert data flow
Search data flow

Insert Data Flow

BLUE - Column-based data structure

ORANGE - Row-based data structure

RED DASHED LINE - Data format conversion

Insert data flow includes following steps:

pymilvus creates a data insert request with type milvuspb.InsertRequest (client/prepare.py::bulk_insert_param)

// grpc-proto/milvus.proto
message InsertRequest {
  common.MsgBase base = 1;
  string db_name = 2;
  string collection_name = 3;
  string partition_name = 4;
  repeated schema.FieldData fields_data = 5;	// fields' data
  repeated uint32 hash_keys = 6;
  uint32 num_rows = 7;
}

Data is inserted into fields_data by column, schemapb.FieldData is defined as following:

// grpc-proto/schema.proto
message ScalarField {
  oneof data {
    BoolArray bool_data = 1;
    IntArray int_data = 2;
    LongArray long_data = 3;
    FloatArray float_data = 4;
    DoubleArray double_data = 5;
    StringArray string_data = 6;
    BytesArray bytes_data = 7;
  }
}

message VectorField {
  int64 dim = 1;
  oneof data {
    FloatArray float_vector = 2;
    bytes binary_vector = 3;
  }
}

message FieldData {
  DataType type = 1;
  string field_name = 2;
  oneof field {
    ScalarField scalars = 3;
    VectorField vectors = 4;
  }
  int64 field_id = 5;
}

milvuspb.InsertRequest is serialized and send via gRPC
Proxy receives milvuspb.InsertRequest, creates InsertTask for it, and adds this task into execution queue (internal/proxy/impl.go::Insert)

InsertTask is executed, the column-based data stored in InsertTask.req is converted to row-based format, and saved into another internal message with type internalpb.InsertRequest (internal/proxy/task.go::transferColumnBasedRequestToRowBasedData)

// internal/proto/internal.proto
message InsertRequest {
  common.MsgBase base = 1;
  string db_name = 2;
  string collection_name = 3;
  string partition_name = 4;
  int64 dbID = 5;
  int64 collectionID = 6;
  int64 partitionID = 7;
  int64 segmentID = 8;
  string channelID = 9;
  repeated uint64 timestamps = 10;
  repeated int64 rowIDs = 11;
  repeated common.Blob row_data = 12;  // row-based data
}

rowID and timestamp are added for each row data

Proxy encapsulates internalpb.InsertRequest into InsertMsg, and send it to pulsar channel
Datanode receives InsertMsg from pulsar channel, restore data to column-based into structure InsertData (internal/datanode/flow_graph_insert_buffer_node.go::insertBufferNode::Operate)
```
type InsertData struct {
    Data  map[FieldID]FieldData // field id to field data
    Infos []BlobInfo
}
```
InsertData is written into Minio with parquet format (internal/datanode/flow_graph_insert_buffer_node.go::flushSegment)

Search Data Flow

BLUE - Column-based data structure

ORANGE - Row-based data structure

RED DASHED LINE - Data format conversion

Search data flow includes following steps:

querynode reads segment's binlog files from Minio, and saves them into structure Blob (internal/querynode/segment_loader.go::loadSegmentFieldsData)
```
type Blob struct {
    Key   string    // binlog file path
    Value []byte    // binlog file data
}
```
The data in Blob is deserialized, raw-data in it is saved into structure InsertData
querynode invokes search engine to get SearchResult (internal/query_node/query_collection.go::search)
```
// internal/core/src/common/Types.h
struct SearchResult {
 ...
 public:
    int64_t num_queries_;
    int64_t topk_;
    std::vector<float> result_distances_;

 public:
    void* segment_;
    std::vector<int64_t> internal_seg_offsets_;
    std::vector<int64_t> result_offsets_;
    std::vector<std::vector<char>> row_data_;
};
```
At this time, only "result_distances_" and "internal_seg_offsets_" of "SearchResult" are filled into data.
querynode reduces all SearchResult returned by segment, fetches all other fields' data, and saves them into "row_data_" with row-based format. (internal/query_node/query_collection.go::reduceSearchResultsAndFillData)

querynode organizes SearchResult again, and save them into structure milvus.Hits

// internal/proto/milvus.proto
message Hits {
  repeated int64 IDs = 1;
  repeated bytes row_data = 2;
  repeated float scores = 3;
}

Row-based data saved in milvus.Hits is converted to column-based data, and saved into schemapb.SearchResultData (internal/query_node/query_collection.go::translateHits)

// internal/proto/schema.proto
message SearchResultData {
  int64 num_queries = 1;
  int64 top_k = 2;
  repeated FieldData fields_data = 3;
  repeated float scores = 4;
  IDs ids = 5;
  repeated int64 topks = 6;
}

schemapb.SearchResultData is serialized, encapsulated as internalpb.SearchResults, saved into SearchResultMsg, and send into pulsar channel (internal/query_node/query_collection.go::search)

// internal/proto/internal.proto
message SearchResults {
  common.MsgBase base = 1;
  common.Status status = 2;
  string result_channelID = 3;
  string metric_type = 4;
  repeated bytes hits = 5;  // search result data

  // schema.SearchResultsData inside
  bytes sliced_blob = 9;
  int64 sliced_num_count = 10;
  int64 sliced_offset = 11;

  repeated int64 sealed_segmentIDs_searched = 6;
  repeated string channelIDs_searched = 7;
  repeated int64 global_sealed_segmentIDs = 8;
}

Proxy collects all SearchResultMsg from querynodes, gets schemapb.SearchResultData by deserialization, then gets milvuspb.SearchResults by reducing, finally send back to SDK visa gRPC. (internal/proxy/task.go::SearchTask::PostExecute)
```
// internal/proto/milvus.proto
message SearchResults {
  common.Status status = 1;
  schema.SearchResultData results = 2;
}
```
SDK receives milvuspb.SearchResult

In above 2 data flows, we can see frequent format conversion between column-based data and row-based data (marked as RED dashed line).

If we use Arrow as all in-memory data format, we can:

omit the serialization and deserialization between SDK and proxy
remove all format conversion between column-based data and row-based data
use Parquet as binlog file format, and write from arrow data directly

Public Interfaces(optional)

We will use Arrow to replace all in-memory data format used in Insert/Search/Query.

All proto definitions which are used to communicate between SDK and proxy are listed below:

Proto changed for Insert

// internal/proto/milvus.proto
message InsertRequest {
  ...
- repeated schema.FieldData fields_data = 5;
+ bytes record_batch = 5;
  ...
}

Proto changed for Search

// internal/proto/milvus.proto
message Hits {
  ...
- repeated bytes row_data = 2;
+ bytes record_batch = 2;
  ...
}

// internal/proto/schema.proto
message SearchResultData {
  ...
- repeated FieldData fields_data = 3;
+ bytes record_batch = 3;
  ...
}

Proto changed for Query

// internal/proto/milvus.proto
message QueryResults {
  ...
- repeated schema.FieldData fields_data = 2;
+ bytes record_batch = 2;
}

Design Details(required)

We divide this MEP into 2 stages, all compatibility changes will be achieved in Stage 1 before Milvus 2.0.0, other internal changes can be left later.

Stage 1

Update InsertRequest in milvus.proto, change Insert to use Arrow format
Update SearchRequest/Hits in milvus.proto, and SearchResultData in schema.proto, change Search to use Arrow format
Update QueryResults in milvus.proto, change Query to use Arrow format

Stage 2

Update Storage module to use GoArrow to write Parquet from Arrow, or read Arrow from Parquet directly, remove C++ Arrow.
Remove all internal row-based data structure, including "RowData" in internalpb.InsertRequest, "row_data" in milvuspb.Hits, "row_data_" in C++ SearchResult.
Optimize search result flow

Insert Data Flow

Task	Current Behavior	Proposal
1	SDK (python/go/JS) send InsertRequest with fields_data	Update all SDK (python/go/JS) to write inserted data into Arrow record, then send InsertRequest with Arrow bytes
2	Proxy receive InsertRequest, if autoID field is empty, create a field with generated IDs, then insert this field into fields_data	Proxy receive InsertRequest, decode Arrow record from record_batch bytes, if autoID field is empty, re-create Arrow record filled with generated IDs
3	change column-based data to row-based data	don't need any more
4	based on hash_key, save row_data into internal.InsertRequest, encapsulate into InsertMsg, then send to pulsar	based on hash_key, save row_data (get via Arrow array.NewSlice) into internal.InsertRequest, encapsulate into InsertMsg, then send to pulsar
5	Datanode receive InsertMsg, convert row_data to column-based data again, then save to Minio with Parquet format	Datanode receive InsertMsg, assemble data slice into Arrow table, then save to Minio with Parquet format

Search Data Flow

Task	Current Behavior	Proposal
1	segcore get SearchResult, reduce, then fill row_data_ re-organize SearchResult, save to MarshaledHits (C++) MarshaledHits (C++) is serialized and copied in Go	segcore get SearchResult, reduce, then fill row_data_ re-organize SearchResult, save to Arrow format (C++) Arrow format search result is serialized and copied in (Go)
2	MarshaledHits is decoded and converted to column-based data saved in schema.SearchResultData	Save serialized Arrow format search result into schema.SearchResultData
3	SearchResultData is serialized, saved into internal.SearchResults, encapsulated into SearchResultMsg, then send to pulsar	no change
4	Proxy receive SearchResultMsg, decode SearchResultData, then reduce	Proxy receive SearchResultMsg, decode SearchResultData, then reduce Arrow format search result
5	send SearchResultData with fields_data to SDK	send SearchResultData with Arrow bytes to SDK

Test Plan(required)

Pass all CI flows

References(optional)

https://arrow.apache.org/docs/

Arrow Test Code (Go)

import (
	"bytes"
	"fmt"
	"testing"

	"github.com/apache/arrow/go/arrow"
	"github.com/apache/arrow/go/arrow/array"
	"github.com/apache/arrow/go/arrow/ipc"
	"github.com/apache/arrow/go/arrow/memory"
)

const (
	_DIM = 4
)

var pool = memory.NewGoAllocator()

func CreateArrowSchema() *arrow.Schema {
	fieldVector := arrow.Field{
		Name: "field_vector",
		Type: arrow.FixedSizeListOf(_DIM, arrow.PrimitiveTypes.Float32),
	}
	fieldVal := arrow.Field{
		Name: "field_val",
		Type: arrow.PrimitiveTypes.Int64,
	}
	schema := arrow.NewSchema([]arrow.Field{fieldVector, fieldVal}, nil)
	return schema
}

func CreateArrowRecord(schema *arrow.Schema, iValues []int64, vValues []float32) array.Record {
	rb := array.NewRecordBuilder(pool, schema)
	defer rb.Release()
	rb.Reserve(len(iValues))

	rowNum := len(iValues)
	for i, field := range rb.Schema().Fields() {
		switch field.Type.ID() {
		case arrow.INT64:
			vb := rb.Field(i).(*array.Int64Builder)
			vb.AppendValues(iValues, nil)
		case arrow.FIXED_SIZE_LIST:
			lb := rb.Field(i).(*array.FixedSizeListBuilder)
			valid := make([]bool, rowNum)
			for i := 0; i < rowNum; i++ {
				valid[i] = true
			}
			lb.AppendValues(valid)
			vb := lb.ValueBuilder().(*array.Float32Builder)
			vb.AppendValues(vValues, nil)
		}
	}

	rec := rb.NewRecord()

	return rec
}

func WriteArrowRecord(schema *arrow.Schema, rec array.Record) []byte {
	defer rec.Release()
	blob := make([]byte, 0)
	buf := bytes.NewBuffer(blob)

	// internal/arrdata/ioutil.go
	writer := ipc.NewWriter(buf, ipc.WithSchema(schema), ipc.WithAllocator(pool))
	defer writer.Close()

	//ShowArrowRecord(rec)
	if err := writer.Write(rec); err != nil {
		panic("could not write record: %v" + err.Error())
	}

	err := writer.Close()
	if err != nil {
		panic(err.Error())
	}

	return buf.Bytes()
}

func ReadArrowRecords(schema *arrow.Schema, blobs [][]byte) array.Record {
	iValues := make([]int64, 0)
	vValues := make([]float32, 0)
	for _, blob := range blobs {
		buf := bytes.NewReader(blob)

		reader, err := ipc.NewReader(buf, ipc.WithSchema(schema), ipc.WithAllocator(pool))
		if err != nil {
			panic("create reader fail: %v" + err.Error())
		}
		defer reader.Release()

		rec, err := reader.Read()
		if err != nil {
			panic("read record fail: %v" + err.Error())
		}
		defer rec.Release()

		for _, col := range rec.Columns() {
			switch col.DataType().ID() {
			case arrow.INT64:
				arr := col.(*array.Int64)
				iValues = append(iValues, arr.Int64Values()...)
			case arrow.FIXED_SIZE_LIST:
				arr := col.(*array.FixedSizeList).ListValues().(*array.Float32)
				vValues = append(vValues, arr.Float32Values()...)
			}
		}
	}
	ret := CreateArrowRecord(schema, iValues, vValues)
	ShowArrowRecord(ret)

	return ret
}

func ReadArrowRecordsToTable(schema *arrow.Schema, blobs [][]byte) array.Table {
	recs := make([]array.Record, 0)
	for _, blob := range blobs {
		buf := bytes.NewReader(blob)

		reader, err := ipc.NewReader(buf, ipc.WithSchema(schema), ipc.WithAllocator(pool))
		if err != nil {
			panic("create reader fail: %v" + err.Error())
		}
		defer reader.Release()

		rec, err := reader.Read()
		if err != nil {
			panic("read record fail: %v" + err.Error())
		}
		defer rec.Release()

		recs = append(recs, rec)
	}
	table := array.NewTableFromRecords(schema, recs)
	ShowArrowTable(table)

	return table
}

func ShowArrowRecord(rec array.Record) {
	fmt.Printf("\n=============================\n")
	fmt.Printf("Schema: %v\n", rec.Schema())
	fmt.Printf("NumCols: %v\n", rec.NumCols())
	fmt.Printf("NumRows: %v\n", rec.NumRows())
	//rowNum := int(rec.NumRows())
	for i, col := range rec.Columns() {
		fmt.Printf("Column[%d] %q: %v\n", i, rec.ColumnName(i), col)
	}
}

func ShowArrowTable(tbl array.Table) {
	fmt.Printf("\n=============================\n")
	fmt.Printf("Schema: %v\n", tbl.Schema())
	fmt.Printf("NumCols: %v\n", tbl.NumCols())
	fmt.Printf("NumRows: %v\n", tbl.NumRows())
	for i := 0; i < int(tbl.NumCols()); i++ {
		col := tbl.Column(i)
		fmt.Printf("Column[%d] %s: %v\n", i, tbl.Schema().Field(i).Name, col.Data().Chunks())
	}
}

func TestArrowIPC(t *testing.T) {
	schema := CreateArrowSchema()
	rec0 := CreateArrowRecord(schema, []int64{0}, []float32{0,0,0,0})
	rec1 := CreateArrowRecord(schema, []int64{1,2,3}, []float32{1,1,1,1,2,2,2,2,3,3,3,3})
	blob0 := WriteArrowRecord(schema, rec0)
	blob1 := WriteArrowRecord(schema, rec1)
	ReadArrowRecords(schema, [][]byte{blob0, blob1})
	ReadArrowRecordsToTable(schema, [][]byte{blob0, blob1})
}

Space shortcuts

Page tree

Summary

Motivation(required)

Insert Data Flow

Search Data Flow

Public Interfaces(optional)

Design Details(required)

Insert Data Flow

Search Data Flow

Test Plan(required)

References(optional)

Space shortcuts

Page tree

MEP 13 -- Support Apache Arrow As In-Memory Data Format

Summary

Motivation(required)

Insert Data Flow

Search Data Flow

Public Interfaces(optional)

Design Details(required)

Insert Data Flow

Search Data Flow

Test Plan(required)

References(optional)