Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Arrow memory usage, test following 3 scenarios used in Milvus:

Data TypeRaw Data Size (Byte)
raw_data
Array Array
buffer
Buffer Size (Byte)
int648000080000
FixedSizeList (float32, dim = 128)51200005160064
string (len = 512)51200005160064

For Scalar data, Arrow Array uses memory as same as raw data;

for vector data or string, Arrow Array uses few more memory than raw data (about 4 bytes for each row).


Give an example to illustrate the problems we will encounter if using Arrow.

During insert, after Proxy receives data, it will encapsulate the data into InsertMsg by line, and then send it to Datanode through Pulsar.

Splitting by line is based on two reasons:

  1. Each collection has two or more physical channels. Data insertion performance can be improved by inserting multiple channels at the same time.
  2. Pulsar limits the size of each InsertMsg

We tried 4 solutions, each has its own problems:

  • Solution-1

After the Proxy receives the inserted data, it only creates one Arrow RecordBatch, encapsulates the data into InsertMsg by line, and then sends it to Datanode through Pulsar.

PROBLEMThere is no interface to read data item by item from Arrow RecordBatch. RecordBatch has a NewSlice interface, but the return value of NewSlice cannot do anything except print.

  • Solution-2

After the Proxy receives the inserted data, it creates multiple Arrow RecordBatch in advance according to the size limit of Pulsar for InsertMsg. The data is serialized according to the RecordBatch, inserted into the InsertMsg, and then sent to the Datanode through Pulsar. Datanode combines multiple RecordBatch into one complete RecordBatch.

PROBLEM: Multiple RecordBatch can only be logically restored to one ArrowTable, but each column of data is physically discontinuous, so subsequent columnar operations cannot be performed.

  • Solution-3

After the proxy receives the inserted data, create multiple Arrow Array by field, instead of RecordBatch.

PROBLEM: The primitive unit of serialized data in Arrow is RecordBatch. Arrow does not provide interface to serialize Arrow Array.

  • Solution-4

After the proxy receives the inserted data, it creates multiple RecordBatch in advance according to the size limit of the Pulsar for InsertMsg. The data is serialized according to the RecordBatch and inserted into the InsertMsg, and then sent to the Datanode through Pulsar. The Datanode receives multiple RecordBatch, fetches the data of each column, and regenerates a new RecordBatch.

PROBLEM: There seems no advantages compared with current implementation.

Summarize some limitations in the use of arrow:

1. Arrow data can only be serialized and deserialized in recordbatch

2. Recordbatch does not support copying data in behavioral units

3. The recordbatch must be re created at the receiving end of the pulsar

The same problem will be encountered in the query data process:

1. The query results obtained by segcore need to be reduced twice. Once, querynode merges the searchresults of multiple segments, and the other time, proxy merges the query results of multiple querynodes. If the query results are in recordbatch format, it is not convenient to reduce because data cannot be copied by line

2. Querynode needs to send the SearchResult to the proxy through pulsar. After receiving the data, the proxy needs to rebuild the recordbatch, which violates the original design intention of arrow zero copy

So I don't think arrow is suitable for the application scenario of Milvus


举个例子说明如果用 Arrow 会遇到的问题

在 Insert 的时候,Proxy 收到数据后,会把数据按行封装到 InsertMsg 中,再通过 Pulsar 发送到 Datanode。

...