Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Current state: "Under DiscussionRejected"

ISSUE: #7210

PRs: 

Keywords: arrow/column-based/row-based

...

Arrow memory usage, test following 3 scenarios used in Milvus:

Data TypeRaw Data Size (Byte)
raw_data
Array Array
buffer
Buffer Size (Byte)
int648000080000
FixedSizeList (float32, dim = 128)51200005160064
string (len = 512)51200005160064

For Scalar data, Arrow Array uses memory as same as raw data;

for vector data or string, Arrow Array uses few more memory than raw data (about 4 bytes for each row).

举个例子说明如果用 Arrow 会遇到的问题

...


Give an example to illustrate the problems we will encounter if using Arrow.

During insert, after Proxy receives data, it will encapsulate the data into InsertMsg by line, and then send it to Datanode through Pulsar.

Splitting by line is based on two reasons:

  1. Each collection has two or more physical channels. Data insertion performance can be improved by inserting multiple channels at the same time.
  2. Pulsar limits the size of each InsertMsg

We tried 4 solutions, each has its own problems:

  • Solution-1

After the Proxy receives the inserted data, it only creates one Arrow RecordBatch, encapsulates the data into InsertMsg by line, and then sends it to Datanode through Pulsar.

PROBLEMThere is no interface to read data item by item from Arrow RecordBatch. RecordBatch has a NewSlice interface, but the return value of NewSlice cannot do anything except print.

  • Solution-2

After the Proxy receives the inserted data, it creates multiple Arrow RecordBatch in advance according to the size limit of Pulsar for InsertMsg. The data is serialized according to the RecordBatch, inserted into the InsertMsg, and then sent to the Datanode through Pulsar. Datanode combines multiple RecordBatch into one complete RecordBatch.

PROBLEM: Multiple RecordBatch can only be logically restored to one ArrowTable, but each column of data is physically discontinuous, so subsequent columnar operations cannot be performed.

  • Solution-3

After the Proxy receives the inserted data, create multiple Arrow Array by field, instead of RecordBatch.

PROBLEM: The primitive unit of serialized data in Arrow is RecordBatch. Arrow does not provide interface to serialize Arrow Array.

  • Solution-4

After the Proxy receives the inserted data, it creates multiple RecordBatch in advance according to the size limit of the Pulsar for InsertMsg. The data is serialized according to the RecordBatch and inserted into the InsertMsg, and then sent to the Datanode through Pulsar. The Datanode receives multiple RecordBatch, fetches the data from each column, and regenerates a new RecordBatch.

PROBLEM: There seems no advantages comparing with current implementation.


Summarize some limitations in the use of Arrow:

  1. Arrow data can only be serialized and deserialized by unit of RecordBatch;
  2. Cannot copy out row data from RecordBatch;
  3. RecordBatch must be regenerated after sending via pulsar.


Arrow is suitable for data analysis scenario (data is sealed and read only).

In Milvus, we need do data split and concatenate.

Arrow seems not a good choice for Milvus.

按行拆分是基于 2 个原因:

...

1. Arrow 数据只能以 RecordBatch 为单位进行序列化和反序列化

2. RecordBatch 不支持以行为单位进行数据拷贝3. 在 Pulsar 的接收端必须重新创建 RecordBatch在查询数据流程会遇到同样的问题:1. segcore 得到的查询结果需要做 2 次 reduce,1 次是 querynode 对多个 segment 的 SearchResult 做归并,另 1 次是 proxy 对多个 querynode 的查询结果做归并,如果查询结果是 RecordBatch 格式,因为无法按行 copy 数据,所以不方便做 reduce 操作2. querynode 需要把 SearchResult 通过 Pulsar 发送到 Proxy,Proxy 在接收到数据后需要重建 RecordBatch,违反 Arrow zero-copy 的设计初衷所以我觉得 Arrow 并不适合 Milvus 这种应用场景


Design Details(required)

We divide this MEP into 2 stages, all compatibility changes will be achieved in Stage 1 before Milvus 2.0.0, other internal changes can be left later.

...