Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Current state: Under Discussion
Keywords: HA, primary-backup, RootCoord, DataCoord, QueryCoord, IndexCoord
Released:

Summary


To achieve HA by supporting primary-backup mechanism for all Coordinators(RootCoord, DataCoord, QueryCoord, IndexCoord).

Motivation


Currently, each Milvus coordinator(coord) is a single node, so there are risks of single points of failure(SPOF). As a distributed system, Milvus needs to mitigate the impact of SPOFs and provides high available service to users. This project aims to support basic primary-backup mechanism, which is a popular high availabilitysolution.

Public Interfaces


No public interfaces need to be added or changed. It's transparent to users.

Design Details


Key points for primary-backup mechanism

...

  1. Try lock the key in etcd. If succeed, this node will become the primary.  → 5. If fail which means there is another node hold the lock. → 2. 
  2. Enter StandyMode. → 3, 4
  3. Watch the primary key in ETCD. When receiving a key DELETE response, → 1, campaign the lock. 
  4. Start a loop goroutine to print log and do WarmUp(). WarmUp is to do something like update the meta to accurate the Restart if necessary.
  5. If it is in StandyMode. If true → 6. If false → 8.
  6. Restart: call internal Start. → 7
  7. Exit StandbyMode. It will stop the loop in 4. → 8 
  8. Register the service to ETCD as primary. → 9
  9. Start up the LivenessCheck goroutine


Test Plan


1, Deploy a cluster with primary and standby coords. Manually remove the primary coord or mock some crash in the primary coord. The standby coord should take over successfully. The cluster must keep working after a short time of partial failure.

...

3, The test should be done for each kind of coords.

Future work