You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 4 Next »

Current state: Accepted

ISSUE: https://github.com/milvus-io/milvus/issues/15604

PRs: 

Keywords: bulk load, import

Released: with Milvus 2.1 

Authors:  

Summary

Import data by a shortcut to get better performance compared with insert(). 


Motivation

Typically, it cost several hours to insert one billion entities with 128-dimensional vectors. We need a new interface to do bulk load for the following purposes:

  1. import data from json format files. (first stage)
  2. import data from numpy format files. (first stage)
  3. copy a collection from one Milvus 2.0 server to another. (second stage)
  4. import data from Milvus 1.x to Milvus 2.0 (third stage)
  5. parquet/faiss files (TBD)

Design Details

Some points to consider:

  • JSON format is flexible, ideally, the import  API ought to parse user's JSON files without asking user to reformat the files according to a strict rule.
  • User can store scalar fields and vector fields in a JSON file, with row-based or column-based, the import API ought t support both of them.

         A row-based example:

{
"table": {
"rows": [

Unknown macro: {"id"}

,

Unknown macro: {"id"}

,

Unknown macro: {"id"}

]
}
}

         A column-based example:

{
"table":

Unknown macro: { "columns"}

}


  • Numpy file is binary format, we only treat it as vector data. Each numpy file represents a vector field.
  • Transferring a large file from client to server proxy to datanode is time-consume work and occupies too much network bandwidth, we will ask user to upload data files to MinIO/S3 where the datanode can access directly. Let the datanode read and parse files from MinIO/S3.
  • The parameter of import API is easy to expand in future

SDK Interfaces



RPC Interfaces



Internal machinery



Test Plan

  • No labels