Current state: Under Discussion
Keywords: HA, primary-backup, RootCoord, DataCoord, QueryCoord, IndexCoord
Released:
To achieve HA by supporting primary-backup mechanism for all Coordinators(RootCoord, DataCoord, QueryCoord, IndexCoord).
Currently, each Milvus coordinator(coord) is a single node, so there are risks of single points of failure(SPOF). As a distributed system, Milvus needs to mitigate the impact of SPOFs and provides high available service to users. This project aims to support basic primary-backup mechanism, which is a popular high availability solution.
No public interfaces need to be added or changed. It's transparent to users.
Key points for primary-backup mechanism
Current start-up process
The start-up progress of a coordinator is like the graph above (take DataCoord as an example, others are very similar). Currently, each component in Milvus cluster maintains a keepAlive lease with ETCD(internal Register in the graph), and Milvus establishes service discovery based on it. Therefore, we can easily detect failure by watching the coords' key in ETCD. The design are mainly in internal register as followed.
internal Register
Develop a cluster with primary and standby coords. Manually remove the primary coord or mock some crash in the primary coord. The standby coord should take over successfully. The cluster must keep working after a short time of partial failure. The test should be done for each kind of coords.