Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Keywords: Connection Manager, Component startup

Released:


Summary

Add a ConnectionManager to manage and monitor grpc Connection client. And with ConnectionManager, we will unified component startup process.

Motivation

We have faced a lot of problems caused by grpc connection. The current version of the grpc connection layer exposes fewer external parameters and is difficult to be controlled by the outer layer. And there is no reasonable way to handle the online and offline information of the session. This will cause a lot of problems listed as follows.

...

  1. The startup logic of each component is not uniform.
  2. The method call of the grpc client cannot be stopped by client.stop(), and the total connection time is too long due to the execution of the serial grpc connection.
  3. In the method recall process, the server can not know other servers’ unregistration, leading to the low latency before a method fails.

Public Interfaces

type ConnectionManager struct {

...

WatchClients(serverName string) []interface


Design Details

Goal:

  1. The startup logic of each component is clear and unified, and the connection of the dependent service becomes necessary when calling the method instead of the startup process.
  2. If the kv registered by the node fails on etcd, the related grpc request should fail quickly and inform the SDK or the initiator of the request;
  3. After the failed node is restarted, the grpc link should be restored automatically. The automatic restoration process should be able to be appreciable and be non-blocking;

...

Through this mechanism, we can achieve goals 2 and 3.


Compatibility, Deprecation, and Migration Plan

And with ClientManager, the startup process will be standardized as the following process.

...

  1. Initialize state code when NewServer().
  2. ClientManager adds Coordinator service dependency. Take QueryCoord as an example to add a dependency on RootCoordinator. ClientManager obtains the required RootCoordinator service address from etcd. If the service has been registered and the address is successfully obtained, the grpc connection is established. If the grpc connection can not be established, then enter the retry logic, retry 3 times, each 30 seconds, and still fail after 3 retries, then give up and retry, but do not panic, mark RootCoordinator offline, and expect to re-trigger the link to RootCoordinator through etcd Add event in the future.
  3. The ClientManager processes the services of the Node that already exists in the meta. Taking QueryCoord as an example, ClientManager obtains the service address of the existing QueryNode from etcd, and opens a separate coroutine for each QueryNode to process it. The processing logic is as follows: Get the address, try to establish a grpc link, if the grpc connection can not be established, retry 3 times, each 30 seconds, if the connection fails, give up processing, if the connection is successful, create the corresponding Client, and add the Client to the ClientManager.
  4. Register its own service address with etcd from its own session
  5. Set statecode to Healthy


Test Plan

The unittest will be added to test the ClientManager to test whether the connection can be reacted quickly.
And the recovery test will be tested with existing problem issue #5976 #6098 #6110 #6236.