Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Typically, it cost several hours to insert one billion entities with 128-dimensional vectors. We need a new interface to do bulk load for the following purposes:

  1. import data from json format files. (first stage)
  2. import data from numpy format files. (first stage)
  3. copy a collection from one Milvus 2.0 server to another. (second stage)
  4. import data from json format files
  5. import data from numpy format files
  6. import data from Milvus 1.x to Milvus 2.0

...

  1. Milvus 1.x to Milvus 2.0 (third stage)
  2. parquet/faiss files (TBD)

Design Details

Some points to consider:

  • JSON format is flexible, ideally, the import  API ought to parse user's JSON files without asking user to reformat the files according to a strict rule.
  • User can store scalar fields and vector fields in a JSON file, with row-based or column-based, the import API ought t support both of them.

         A row-based example:

Wiki Markup
{
  "table": {
    "rows": [
      {"id": 1, "year": 2021, "vector": [1.0, 1.1, 1.2]},
      {"id": 2, "year": 2022, "vector": [2.0, 2.1, 2.2]},
      {"id": 3, "year": 2023, "vector": [3.0, 3.1, 3.2]}
    ]
  }
}

         A column-based example:

Wiki Markup
{
  "table": {
    "columns": [
      "id": [1, 2, 3],
      "year": [2021, 2022, 2023],
      "vector": [
        [1.0, 1.1, 1.2],
        [2.0, 2.1, 2.2],
        [3.0, 3.1, 3.2]
      ]
    ]
  }
}


  • Numpy file is binary format, we only treat it as vector data. Each numpy file represents a vector field.
  • Transferring a large file from client to server proxy to datanode is time-consume work and occupies too much network bandwidth, we will ask user to upload data files to MinIO/S3 where the datanode can access directly. Let the datanode read and parse files from MinIO/S3.
  • The parameter of import API is easy to expand in future

SDK Interfaces



RPC Interfaces

...